Against (talking about) “big data.”

Is big data the future of X? Yes, absolutely, for all X. No, forget about big data: small data is the real revolution! No, wait. Forget about big and small — what matters is long data.

800px-Looking_Up_at_Empire_State_BuildingConversation about “big data” has become a hilarious game of buzzword bingo, aggravated by one of the great strengths of social media — the way conversations in one industry or field seep into another. I’ve seen humanists retweet an article by a data scientist criticizing “big data,” only to discover a week later that their author defines “small data” as anything less than a terabyte. Since the projects that humanists would call “big” usually involve less than a tenth of a terabyte, it turns out that our brutal gigantism is actually artisanal and twee.

The discussion is incoherent, but human beings like discussion, and are reluctant to abandon a lively one just because it makes no sense. One popular way to save this conversation is to propose that the “big” in “big data” may be a purely relative term. It’s “whatever is big for you.” In other words, perhaps we’re discussing a generalized expansion of scale, across all scales? For Google, “big data” might mean moving from petabytes to exabytes. For a biologist, it might mean moving from gigabytes to terabytes. For a humanist, it might mean any use of quantitative methods at all.

This solution is rhetorically appealing, but still incoherent. The problem isn’t just that we’re talking about different sizes of data. It’s that the concept of “big data” conflates trends located in different social contexts, that raise fundamentally different questions.

To sort things out a little, let me name a few of the different contexts involved:

1) Big IT companies are simply confronting new logistical problems. E.g., if you’re wrangling a petabyte or more, it no longer makes sense to move the data around. Instead you want to clone your algorithm and send it to the (various) machines where the data already lives.

2) But this technical sense of the word shades imperceptibly into another sense where it’s really a name for new business opportunities. The fact that commerce is now digital means that companies can get a new stream of information about consumers. This sort of market research may or may not actually require managing “big data” in sense (1). A widely-cited argument from Microsoft Research suggests that most applications of this kind involve less than 14GB and could fit into memory on a single machine.

3) Interest in these business opportunities has raised the profile of a loosely-defined field called “data science,” which might include machine learning, data mining, information retrieval, statistics, and software engineering, as well as aspects of social-scientific and humanistic analysis. When The New York Times writes that a Yale researcher has “used Big Data” to reveal X — with creepy capitalization — they’re not usually making a claim about the size of the dataset at all. They mean that some combination of tools from this toolkit was involved.

4) Social media produces new opportunities not only for corporations, but for social scientists, who now have access to a huge dataset of interactions between real, live, dubiously representative people. When academics talk about “big data,” they’re most often discussing the promise and peril of this research. Jean Burgess and Axel Bruns have focused explicitly on the challenges of research using Twitter, as have Melissa Terras, Shirley Williams, and Claire Warwick.

5) Some prominent voices (e.g., the editor-in-chief of Wired) have argued that the availability of data makes explicit theory-building less important. Most academics I know are at least slightly skeptical. The best case for this thesis might be something like machine translation, where a brute-force approach based on a big corpus of examples turns out to be more efficient than a painstakingly crafted linguistic model. Clement Levallois, Stephanie Steinmetz, and Paul Wouters have reflected thoughtfully on the implications for social science.

6) In a development that may or may not have anything to do with senses 1-5, quantitative methods have started to seem less ridiculous to humanists. Quantitative research has a long history in the humanities, from ARTFL to the Annales school to nineteenth-century philology. But it has never occupied center stage — and still doesn’t, although it is now considered worthy of debate. Since humanists usually still work with small numbers of examples, any study with n > 50 is in danger of being described as an example of “big data.”

These are six profoundly different issues. I don’t mean to deny that they’re connected: contemporaneous trends are almost always connected somehow. The emergence of the Internet is probably a causal factor in everything described above.

But we’re still talking about developments that are very different — not just because they involve different scales, but because they’re grounded in different institutions and ideas. I can understand why journalists are tempted to lump all six together with a buzzword: buzz is something that journalists can’t afford to ignore. But academics should resist taking the bait: you can’t make a cogent argument about a buzzword.

I think it’s particularly a mistake to assume that interest in scale is associated with optimism about the value of quantitative analysis. That seems to be the assumption driving a lot of debate about this buzzword, but it doesn’t have to be true at all.

To take an example close to my heart: the reason I don’t try to mine small datasets is that I’m actually very skeptical about the humanistic value of quantification. Until we get full-blown AI, I doubt that computers will add much to our interpretation of one, or five, or twenty texts. In the context of obsession with the boosterism surrounding “big data,” people tend to understand this hesitation as a devaluation of something called (strangely) “small data.” But the issue is really the reverse: the interpretive problems in individual works are interesting and difficult, and I don’t think digital technology provides enough leverage to crack them. In the humanities, numbers help mainly with simple problems that happen to be too large to fit in human memory.

To make a long story short: “big data” is not an imprecise-but-necessary term. It’s a journalistic buzzword with a genuinely harmful kind of incoherence. I personally avoid it, and I think even journalists should proceed with caution.

7 thoughts on “Against (talking about) “big data.”

  1. A very interesting discussion. If you don’t mind my focusing on just one of the issues you raise, Levallois, Stenmetz, and Wouters were in large part responding to the reception of Savage and Burrows’s 2007 article, ‘The coming crisis of empirical sociology’ (http://soc.sagepub.com/content/41/5/885.full.pdf), which predates the current obsession with ‘big data’, and reading your article prompted me to return to that piece after having almost forgotten it. To summarise Savage and Burrows’s argument, empirical sociologists have long worked with what they considered to be large datasets, so the ‘crisis’ for them arises from the realisation that ‘the routine operations of a large capitalist institution’ may produce (as a mere ‘digital by-product’) ‘data which dwarfs anything that an academic social scientist could garner’ (p. 887). If the defining feature of a sociologist is the use of a particular set of tools to study human populations, what’s the point in being a sociologist when large corporations have access to apparently bigger and better tools for doing precisely that? Savage and Burrows’s solution is a turn to a ‘sociology [that] seeks to define itself through a concern with research methods… as… an intrinsic feature of contemporary capitalist organisation.’ (p. 896) This sociology would not be organised around the use of particular techniques for social research, in other words, but around critique of social research as carried out in ‘contemporary capitalist organisation’.

    The situation of empirical sociology, as described above, is very different from the position from which the humanities approach what humanists generally consider to be big data (as you put it, ‘[f]or a humanist, [big data] might mean any use of quantitative methods at all’). But what strikes me now is that Savage and Burrows were proposing what would seem to be a particularly ‘humanist’ form of sociology.

  2. I’ve finally had time to read the article you link by Savage and Burrows. It’s very interesting. This isn’t a debate I was aware of, but it does seem really central to sense #4 above.

    There might be a bit more of a parallel to the humanities than is initially apparent, inasmuch as a lot of large-scale text mining has depended on private-sector sources (Gale Cengage, or Google Books, or Booklamp, etc.) And in at least one respect humanists have responded as Savage and Burrows propose: we have pressed for those sources to be made public, with a fair amount of success. (Actually librarians may deserve much of the credit for the success.)

    Probably in coming years we’re going to confront challenges that are even more parallel to sociologists’ dilemma. The corporation that produces a multi-player online game is going to have the world’s best information about the history of audience and reception. We’ll need access to their market research in order to really understand the social history of these forms.

    • That’s a great point: campaigning for openness is a form of engagement with precisely these issues.

      Your point about multi-player games reminds me of something else: everyone I know who has thought of using Twitter for research purposes has ended up disappointed because the API is crippled, yet asking Twitter to improve it won’t help because it crippled the API on purpose. Twitter sees user data as one of its biggest assets, and protects it for commercial reasons. We’ll run into the same problem with reception data for games. To Blizzard Entertainment or whoever, that data is a competitive edge that would disappear if shared.

      Actually, this has just reminded me of a passage I read in an introductory media studies book, years and years ago (it might have been the one by Denis McQuail). It basically said, there are two kinds of audience research. One is done by academics, working on a relatively small scale and publishing their findings. The other is done by media corporations, working on a much larger scale – but we know virtually nothing about it, because it’s all private.

  3. Pingback: Big Data in the arts and culture sector: background reading - Chris Unitt's blog

  4. Pingback: Big data and the study of reading - Digital Reading Network

  5. Pingback: On Exactitude and Messiness in (Big) Science (Data) | acadamia nuts, digital humanities flavour

  6. Pingback: An Introduction to the textreuse Package, with Suggested Applications – The Backward Glance

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s