problems of scale

Against (talking about) “big data.”

Is big data the future of X? Yes, absolutely, for all X. No, forget about big data: small data is the real revolution! No, wait. Forget about big and small — what matters is long data.

800px-Looking_Up_at_Empire_State_BuildingConversation about “big data” has become a hilarious game of buzzword bingo, aggravated by one of the great strengths of social media — the way conversations in one industry or field seep into another. I’ve seen humanists retweet an article by a data scientist criticizing “big data,” only to discover a week later that their author defines “small data” as anything less than a terabyte. Since the projects that humanists would call “big” usually involve less than a tenth of a terabyte, it turns out that our brutal gigantism is actually artisanal and twee.

The discussion is incoherent, but human beings like discussion, and are reluctant to abandon a lively one just because it makes no sense. One popular way to save this conversation is to propose that the “big” in “big data” may be a purely relative term. It’s “whatever is big for you.” In other words, perhaps we’re discussing a generalized expansion of scale, across all scales? For Google, “big data” might mean moving from petabytes to exabytes. For a biologist, it might mean moving from gigabytes to terabytes. For a humanist, it might mean any use of quantitative methods at all.

This solution is rhetorically appealing, but still incoherent. The problem isn’t just that we’re talking about different sizes of data. It’s that the concept of “big data” conflates trends located in different social contexts, that raise fundamentally different questions.

To sort things out a little, let me name a few of the different contexts involved:

1) Big IT companies are simply confronting new logistical problems. E.g., if you’re wrangling a petabyte or more, it no longer makes sense to move the data around. Instead you want to clone your algorithm and send it to the (various) machines where the data already lives.

2) But this technical sense of the word shades imperceptibly into another sense where it’s really a name for new business opportunities. The fact that commerce is now digital means that companies can get a new stream of information about consumers. This sort of market research may or may not actually require managing “big data” in sense (1). A widely-cited argument from Microsoft Research suggests that most applications of this kind involve less than 14GB and could fit into memory on a single machine.

3) Interest in these business opportunities has raised the profile of a loosely-defined field called “data science,” which might include machine learning, data mining, information retrieval, statistics, and software engineering, as well as aspects of social-scientific and humanistic analysis. When The New York Times writes that a Yale researcher has “used Big Data” to reveal X — with creepy capitalization — they’re not usually making a claim about the size of the dataset at all. They mean that some combination of tools from this toolkit was involved.

4) Social media produces new opportunities not only for corporations, but for social scientists, who now have access to a huge dataset of interactions between real, live, dubiously representative people. When academics talk about “big data,” they’re most often discussing the promise and peril of this research. Jean Burgess and Axel Bruns have focused explicitly on the challenges of research using Twitter, as have Melissa Terras, Shirley Williams, and Claire Warwick.

5) Some prominent voices (e.g., the editor-in-chief of Wired) have argued that the availability of data makes explicit theory-building less important. Most academics I know are at least slightly skeptical. The best case for this thesis might be something like machine translation, where a brute-force approach based on a big corpus of examples turns out to be more efficient than a painstakingly crafted linguistic model. Clement Levallois, Stephanie Steinmetz, and Paul Wouters have reflected thoughtfully on the implications for social science.

6) In a development that may or may not have anything to do with senses 1-5, quantitative methods have started to seem less ridiculous to humanists. Quantitative research has a long history in the humanities, from ARTFL to the Annales school to nineteenth-century philology. But it has never occupied center stage — and still doesn’t, although it is now considered worthy of debate. Since humanists usually still work with small numbers of examples, any study with n > 50 is in danger of being described as an example of “big data.”

These are six profoundly different issues. I don’t mean to deny that they’re connected: contemporaneous trends are almost always connected somehow. The emergence of the Internet is probably a causal factor in everything described above.

But we’re still talking about developments that are very different — not just because they involve different scales, but because they’re grounded in different institutions and ideas. I can understand why journalists are tempted to lump all six together with a buzzword: buzz is something that journalists can’t afford to ignore. But academics should resist taking the bait: you can’t make a cogent argument about a buzzword.

I think it’s particularly a mistake to assume that interest in scale is associated with optimism about the value of quantitative analysis. That seems to be the assumption driving a lot of debate about this buzzword, but it doesn’t have to be true at all.

To take an example close to my heart: the reason I don’t try to mine small datasets is that I’m actually very skeptical about the humanistic value of quantification. Until we get full-blown AI, I doubt that computers will add much to our interpretation of one, or five, or twenty texts. In the context of obsession with the boosterism surrounding “big data,” people tend to understand this hesitation as a devaluation of something called (strangely) “small data.” But the issue is really the reverse: the interpretive problems in individual works are interesting and difficult, and I don’t think digital technology provides enough leverage to crack them. In the humanities, numbers help mainly with simple problems that happen to be too large to fit in human memory.

To make a long story short: “big data” is not an imprecise-but-necessary term. It’s a journalistic buzzword with a genuinely harmful kind of incoherence. I personally avoid it, and I think even journalists should proceed with caution.


A new approach to the history of character?

In Macroanalysis, Matt Jockers points out that computational stylistics has found it hard to grapple with “the aspects of writing that readers care most deeply about, namely plot, character, and theme” (118). He then proceeds to use topic modeling to pretty thoroughly anatomize theme in the nineteenth-century novel. One down, I guess, two to go!

But plot and character are probably harder than theme; it’s not yet clear how we would trace those patterns in thousands of volumes. So I think it may be worth flagging a very promising article by David Bamman, Brendan O’Connor, and Noah A. Smith. Computer scientists don’t often develop a new methodology that could seriously enrich criticism of literature and film. But this one deserves a look. (Hat tip to Lynn Cherny, by the way, for this lead.)

Emotion-Masks-760092The central insight in the article is that character can be modeled grammatically. If you can use natural language processing to parse sentences, you should be able to identify what’s being said about a given character. The authors cleverly sort “what’s being said” into three questions: what does the character do, what do they suffer or undergo, and what qualities are attributed to them? The authors accordingly model character types (or “personas”) as a set of three distributions over these different domains. For instance, the ZOMBIE persona might do a lot of “eating” and “killing,” get “killed” in turn, and find himself described as “dead.”

The authors try to identify character types of this kind in a collection of 42,306 movie plot summaries extracted from Wikipedia. The model they use is a generative one, which entails assumptions that literary critics would call “structuralist.” Movies in a given genre have a tendency to rely on certain recurring character types. Those character types in turn “generate” the specific characters in a given story, which in turn generate the actions and attributes described in the plot summary.

Using this model, they reason inward from both ends of the process. On the one hand, we know the genres that particular movies belong to. On the other hand, we can see that certain actions and attributes tend to recur together in plot summaries. Can we infer the missing link in this process — the latent character types (“personas”) that mediate the connection from genre to action?

It’s a very thoughtful model, both mathematically and critically. Does it work? Different disciplines will judge success in different ways. Computer scientists tend to want to validate a model against some kind of ground truth; in this case they test it against character patterns described by fans on TV Tropes. Film critics may be less interested in validating the model than in seeing whether it tells them anything new about character. And I think the model may actually have some new things to reveal; among other things, it suggests that the vocabulary used to describe character is strongly coded by genre. In certain genres, characters “flirt,” in others, they “switch” or “are switched.” In some genres, characters merely “defeat” each other; in other genres, they “decapitate” or “are decapitated”!

Since an association with genre is built into the generative assumptions that define the article’s model of character, this might be a predetermined result. But it also raises a hugely interesting question, and there’s lots of room for experimentation here. If the authors’ model of character is too structuralist for your taste, you’re free to sketch a different one and give it a try! Or, if you’re skeptical about our ability to fully “model” character, you could refuse to frame a generative model at all, and just use clustering algorithms in an ad hoc exploratory way to find clues.

Critics will probably also cavil about the dataset (which the authors have generously made available). Do Wikipedia plot summaries tell us about recurring character patterns in film, or do they tell us about the character patterns that are most readily recognized by editors of Wikipedia?

But I think it would be a mistake to cavil. When computer scientists hand you a new tool, the question to ask is not, “Have they used it yet to write innovative criticism?” The question to ask is, “Could we use this?” And clearly, we could.

The approach embodied in this article could be enormously valuable: it could help distant reading move beyond broad stylistic questions and start to grapple with the explicit social content of fiction (and for that matter, nonfiction, which may also rely on implicit schemas of character, as the authors shrewdly point out). Ideally, we would not only map the assumptions about character that typify a given period, but describe how those patterns have changed across time.

Making that work will not be simple: as always, the real problem is the messiness of the data. Applying this technique to actual fictive texts will be a lot harder than applying it to a plot summary. Character names are often left implicit. Many different voices speak; they’re not all equally reliable. And so on.

But the Wordseer Project at Berkeley has begun to address some of these problems. Also, it’s possible that the solution is to scale up instead of sweating the details of coreference resolution: an error rate of 20 or 30% might not matter very much, if you’re looking at strongly marked patterns in a corpus of 40,000 novels.

In any case, this seems to me an exciting lead, worthy of further exploration.

Postscript: Just to illustrate some of the questions that come up: How gendered are character types? The article by Bamman et. al. explicitly models gender as a variable, but the types it ends up identifying are less gender-segregated than I might expect. The heroes and heroines of romantic comedy, for instance, seem to be described in similar ways. Would this also be true in nineteenth-century fiction?