Conversation about “big data” has become a hilarious game of buzzword bingo, aggravated by one of the great strengths of social media — the way conversations in one industry or field seep into another. I’ve seen humanists retweet an article by a data scientist criticizing “big data,” only to discover a week later that their author defines “small data” as anything less than a terabyte. Since the projects that humanists would call “big” usually involve less than a tenth of a terabyte, it turns out that our brutal gigantism is actually artisanal and twee.
The discussion is incoherent, but human beings like discussion, and are reluctant to abandon a lively one just because it makes no sense. One popular way to save this conversation is to propose that the “big” in “big data” may be a purely relative term. It’s “whatever is big for you.” In other words, perhaps we’re discussing a generalized expansion of scale, across all scales? For Google, “big data” might mean moving from petabytes to exabytes. For a biologist, it might mean moving from gigabytes to terabytes. For a humanist, it might mean any use of quantitative methods at all.
This solution is rhetorically appealing, but still incoherent. The problem isn’t just that we’re talking about different sizes of data. It’s that the concept of “big data” conflates trends located in different social contexts, that raise fundamentally different questions.
To sort things out a little, let me name a few of the different contexts involved:
1) Big IT companies are simply confronting new logistical problems. E.g., if you’re wrangling a petabyte or more, it no longer makes sense to move the data around. Instead you want to clone your algorithm and send it to the (various) machines where the data already lives.
2) But this technical sense of the word shades imperceptibly into another sense where it’s really a name for new business opportunities. The fact that commerce is now digital means that companies can get a new stream of information about consumers. This sort of market research may or may not actually require managing “big data” in sense (1). A widely-cited argument from Microsoft Research suggests that most applications of this kind involve less than 14GB and could fit into memory on a single machine.
3) Interest in these business opportunities has raised the profile of a loosely-defined field called “data science,” which might include machine learning, data mining, information retrieval, statistics, and software engineering, as well as aspects of social-scientific and humanistic analysis. When The New York Times writes that a Yale researcher has “used Big Data” to reveal X — with creepy capitalization — they’re not usually making a claim about the size of the dataset at all. They mean that some combination of tools from this toolkit was involved.
4) Social media produces new opportunities not only for corporations, but for social scientists, who now have access to a huge dataset of interactions between real, live, dubiously representative people. When academics talk about “big data,” they’re most often discussing the promise and peril of this research. Jean Burgess and Axel Bruns have focused explicitly on the challenges of research using Twitter, as have Melissa Terras, Shirley Williams, and Claire Warwick.
5) Some prominent voices (e.g., the editor-in-chief of Wired) have argued that the availability of data makes explicit theory-building less important. Most academics I know are at least slightly skeptical. The best case for this thesis might be something like machine translation, where a brute-force approach based on a big corpus of examples turns out to be more efficient than a painstakingly crafted linguistic model. Clement Levallois, Stephanie Steinmetz, and Paul Wouters have reflected thoughtfully on the implications for social science.
6) In a development that may or may not have anything to do with senses 1-5, quantitative methods have started to seem less ridiculous to humanists. Quantitative research has a long history in the humanities, from ARTFL to the Annales school to nineteenth-century philology. But it has never occupied center stage — and still doesn’t, although it is now considered worthy of debate. Since humanists usually still work with small numbers of examples, any study with n > 50 is in danger of being described as an example of “big data.”
These are six profoundly different issues. I don’t mean to deny that they’re connected: contemporaneous trends are almost always connected somehow. The emergence of the Internet is probably a causal factor in everything described above.
But we’re still talking about developments that are very different — not just because they involve different scales, but because they’re grounded in different institutions and ideas. I can understand why journalists are tempted to lump all six together with a buzzword: buzz is something that journalists can’t afford to ignore. But academics should resist taking the bait: you can’t make a cogent argument about a buzzword.
I think it’s particularly a mistake to assume that interest in scale is associated with optimism about the value of quantitative analysis. That seems to be the assumption driving a lot of debate about this buzzword, but it doesn’t have to be true at all.
To take an example close to my heart: the reason I don’t try to mine small datasets is that I’m actually very skeptical about the humanistic value of quantification. Until we get full-blown AI, I doubt that computers will add much to our interpretation of one, or five, or twenty texts. In the context of obsession with the boosterism surrounding “big data,” people tend to understand this hesitation as a devaluation of something called (strangely) “small data.” But the issue is really the reverse: the interpretive problems in individual works are interesting and difficult, and I don’t think digital technology provides enough leverage to crack them. In the humanities, numbers help mainly with simple problems that happen to be too large to fit in human memory.
To make a long story short: “big data” is not an imprecise-but-necessary term. It’s a journalistic buzzword with a genuinely harmful kind of incoherence. I personally avoid it, and I think even journalists should proceed with caution.