On not trusting people who promise “to use their powers for good.”

Data mining is troubling for some of the same reasons that social science in general is troubling. It suggests that our actions are legible from a perspective we don’t immediately possess, and reveal things we haven’t consciously chosen to reveal. This asymmetry of knowledge is unsettling even when posed abstractly as a question of privacy. It becomes more concretely worrisome when power is added to the equation. Kieran Healy has written a timely blog post showing how the network analysis that allows us to better understand Boston in the 1770s could also be used as an instrument of social control. The NSA’s programs of secret surveillance are Healy’s immediate target, but it’s not difficult to imagine that corporate data mining could be used in equally troubling ways.

Right now, for reasons of copyright law, humanists mostly mine data about the dead. But if we start teaching students how to do this, it’s very likely that some of them will end up working in corporations or in the government. So it’s reasonable to ask how we propose to deal with the political questions these methods raise.

My own view is that we should resist the temptation to say anything reassuring, because professional expertise can’t actually resolve the underlying political problem. Any reassurance academics might offer will be deceptive.

The classic form of this deception is familiar from the opening scenes of a monster movie. “Relax! I can assure you that the serum I have developed will only be used for good.”

Poster from the 1880s, courtesy Wikimedia commons.

Poster from the 1880s, courtesy Wikimedia commons.

Of course, something Goes Horribly Wrong. But since monster movies aren’t usually made about humanists, we may not recognize ourselves in this picture. We don’t usually “promise to use our powers for good”; we strike a different tone.

For instance: “I admit that in their current form, these methods are problematic. They have the potential to reduce people to metadata in a way that would be complicit with state and corporate power. But we can’t un-invent computers or statistical analysis. So I think humanists need to be actively involved in these emerging discourses as cultural critics. We must apply our humanistic values to create a theoretical framework that will ensure new forms of knowledge get used in cautious, humane, skeptical ways.”

I suspect some version of that statement will be very popular among humanists. It strikes a tone we’re comfortable with, and it implies that there’s an urgent need for our talents. And in fact, there’s nothing wrong with articulating a critical, humanistic perspective on data mining. It’s worth a try.

But if you back up far enough — far enough that you’re standing outside the academy altogether — humanists’ claims about the restraining value of cultural critique sound a lot like “I promise only to use my powers for good.” The naive scientist says “trust me; my professional integrity will ensure that this gets used well.” The naive humanist says “trust me; my powers of skeptical critique will ensure that this gets used well.” I wouldn’t advise the public to trust either of them.

I don’t have a solution to offer, either. Just about everything human beings have invented — from long pointy sticks to mathematics to cultural critique — can be used badly. It’s entirely possible that we could screw things up in a major way, and end up in an authoritarian surveillance state. Mike Konczal suggests we’re already there. I think history has some useful guidance to offer, but ultimately, “making sure we don’t screw this up” is not a problem that can be solved by any form of professional expertise. It’s a political problem — which is to say, it’s up to all of us to solve it.

The case of Edward Snowden may be worth a moment’s thought here. I’m not in a position to decide whether he acted rightly. We don’t have all the facts yet, and even when we have them, it may turn out to be a nasty moral problem without clear answers. What is clear is that Snowden was grappling with exactly the kinds of political questions data mining will raise. He had to ask himself, not just whether the knowledge produced by the NSA was being abused today, but whether it was a kind of knowledge that might structurally invite abuse over a longer historical timeframe. To think that question through you have to know something about the ways societies can change; you have to imagine the perspectives of people outside your immediate environment, and you have to have some skepticism about the distorting effects of your own personal interest.

These are exactly the kinds of reflection that I hope the humanities foster; they have a political value that reaches well beyond data mining in particular. But Snowden’s case is especially instructive because he’s one of the 70% of Americans who don’t have a bachelor’s degree. Wherever he learned to think this way, it wasn’t from a college course in the humanities. Instead he seems to have relied on a vernacular political tradition that told him certain questions ought to be decided by “the public,” and not delegated to professional experts.

Again, I don’t know whether Snowden acted rightly. But in general, I think traditions of democratic governance are a more effective brake on abuses of knowledge than any code of professional ethics. In fact, the notion of “professional ethics” can be a bit counter-productive here since it implies that certain decisions have to be restricted to people with an appropriate sort of training or cultivation. (See Timothy Burke’s related reflections on “the covert imagination.”)

I’m not suggesting that we shouldn’t criticize abuses of statistical knowledge; on the contrary, that’s an important topic, and I expect that many good things will be written about it both by humanists and by statisticians. What I’m saying is that we shouldn’t imagine that our political responsibilities on this topic can ever be subsumed in or delegated to our professional identities. The tension between authoritarian and democratic uses of social knowledge is not a problem that can be resolved by a more chastened or enlightened methodology, or by any form of professional expertise. It requires concrete political action — which is to say, it has to be decided by all of us.

Against (talking about) “big data.”

Is big data the future of X? Yes, absolutely, for all X. No, forget about big data: small data is the real revolution! No, wait. Forget about big and small — what matters is long data.

800px-Looking_Up_at_Empire_State_BuildingConversation about “big data” has become a hilarious game of buzzword bingo, aggravated by one of the great strengths of social media — the way conversations in one industry or field seep into another. I’ve seen humanists retweet an article by a data scientist criticizing “big data,” only to discover a week later that their author defines “small data” as anything less than a terabyte. Since the projects that humanists would call “big” usually involve less than a tenth of a terabyte, it turns out that our brutal gigantism is actually artisanal and twee.

The discussion is incoherent, but human beings like discussion, and are reluctant to abandon a lively one just because it makes no sense. One popular way to save this conversation is to propose that the “big” in “big data” may be a purely relative term. It’s “whatever is big for you.” In other words, perhaps we’re discussing a generalized expansion of scale, across all scales? For Google, “big data” might mean moving from petabytes to exabytes. For a biologist, it might mean moving from gigabytes to terabytes. For a humanist, it might mean any use of quantitative methods at all.

This solution is rhetorically appealing, but still incoherent. The problem isn’t just that we’re talking about different sizes of data. It’s that the concept of “big data” conflates trends located in different social contexts, that raise fundamentally different questions.

To sort things out a little, let me name a few of the different contexts involved:

1) Big IT companies are simply confronting new logistical problems. E.g., if you’re wrangling a petabyte or more, it no longer makes sense to move the data around. Instead you want to clone your algorithm and send it to the (various) machines where the data already lives.

2) But this technical sense of the word shades imperceptibly into another sense where it’s really a name for new business opportunities. The fact that commerce is now digital means that companies can get a new stream of information about consumers. This sort of market research may or may not actually require managing “big data” in sense (1). A widely-cited argument from Microsoft Research suggests that most applications of this kind involve less than 14GB and could fit into memory on a single machine.

3) Interest in these business opportunities has raised the profile of a loosely-defined field called “data science,” which might include machine learning, data mining, information retrieval, statistics, and software engineering, as well as aspects of social-scientific and humanistic analysis. When The New York Times writes that a Yale researcher has “used Big Data” to reveal X — with creepy capitalization — they’re not usually making a claim about the size of the dataset at all. They mean that some combination of tools from this toolkit was involved.

4) Social media produces new opportunities not only for corporations, but for social scientists, who now have access to a huge dataset of interactions between real, live, dubiously representative people. When academics talk about “big data,” they’re most often discussing the promise and peril of this research. Jean Burgess and Axel Bruns have focused explicitly on the challenges of research using Twitter, as have Melissa Terras, Shirley Williams, and Claire Warwick.

5) Some prominent voices (e.g., the editor-in-chief of Wired) have argued that the availability of data makes explicit theory-building less important. Most academics I know are at least slightly skeptical. The best case for this thesis might be something like machine translation, where a brute-force approach based on a big corpus of examples turns out to be more efficient than a painstakingly crafted linguistic model. Clement Levallois, Stephanie Steinmetz, and Paul Wouters have reflected thoughtfully on the implications for social science.

6) In a development that may or may not have anything to do with senses 1-5, quantitative methods have started to seem less ridiculous to humanists. Quantitative research has a long history in the humanities, from ARTFL to the Annales school to nineteenth-century philology. But it has never occupied center stage — and still doesn’t, although it is now considered worthy of debate. Since humanists usually still work with small numbers of examples, any study with n > 50 is in danger of being described as an example of “big data.”

These are six profoundly different issues. I don’t mean to deny that they’re connected: contemporaneous trends are almost always connected somehow. The emergence of the Internet is probably a causal factor in everything described above.

But we’re still talking about developments that are very different — not just because they involve different scales, but because they’re grounded in different institutions and ideas. I can understand why journalists are tempted to lump all six together with a buzzword: buzz is something that journalists can’t afford to ignore. But academics should resist taking the bait: you can’t make a cogent argument about a buzzword.

I think it’s particularly a mistake to assume that interest in scale is associated with optimism about the value of quantitative analysis. That seems to be the assumption driving a lot of debate about this buzzword, but it doesn’t have to be true at all.

To take an example close to my heart: the reason I don’t try to mine small datasets is that I’m actually very skeptical about the humanistic value of quantification. Until we get full-blown AI, I doubt that computers will add much to our interpretation of one, or five, or twenty texts. In the context of obsession with the boosterism surrounding “big data,” people tend to understand this hesitation as a devaluation of something called (strangely) “small data.” But the issue is really the reverse: the interpretive problems in individual works are interesting and difficult, and I don’t think digital technology provides enough leverage to crack them. In the humanities, numbers help mainly with simple problems that happen to be too large to fit in human memory.

To make a long story short: “big data” is not an imprecise-but-necessary term. It’s a journalistic buzzword with a genuinely harmful kind of incoherence. I personally avoid it, and I think even journalists should proceed with caution.

A new approach to the history of character?

In Macroanalysis, Matt Jockers points out that computational stylistics has found it hard to grapple with “the aspects of writing that readers care most deeply about, namely plot, character, and theme” (118). He then proceeds to use topic modeling to pretty thoroughly anatomize theme in the nineteenth-century novel. One down, I guess, two to go!

But plot and character are probably harder than theme; it’s not yet clear how we would trace those patterns in thousands of volumes. So I think it may be worth flagging a very promising article by David Bamman, Brendan O’Connor, and Noah A. Smith. Computer scientists don’t often develop a new methodology that could seriously enrich criticism of literature and film. But this one deserves a look. (Hat tip to Lynn Cherny, by the way, for this lead.)

Emotion-Masks-760092The central insight in the article is that character can be modeled grammatically. If you can use natural language processing to parse sentences, you should be able to identify what’s being said about a given character. The authors cleverly sort “what’s being said” into three questions: what does the character do, what do they suffer or undergo, and what qualities are attributed to them? The authors accordingly model character types (or “personas”) as a set of three distributions over these different domains. For instance, the ZOMBIE persona might do a lot of “eating” and “killing,” get “killed” in turn, and find himself described as “dead.”

The authors try to identify character types of this kind in a collection of 42,306 movie plot summaries extracted from Wikipedia. The model they use is a generative one, which entails assumptions that literary critics would call “structuralist.” Movies in a given genre have a tendency to rely on certain recurring character types. Those character types in turn “generate” the specific characters in a given story, which in turn generate the actions and attributes described in the plot summary.

Using this model, they reason inward from both ends of the process. On the one hand, we know the genres that particular movies belong to. On the other hand, we can see that certain actions and attributes tend to recur together in plot summaries. Can we infer the missing link in this process — the latent character types (“personas”) that mediate the connection from genre to action?

It’s a very thoughtful model, both mathematically and critically. Does it work? Different disciplines will judge success in different ways. Computer scientists tend to want to validate a model against some kind of ground truth; in this case they test it against character patterns described by fans on TV Tropes. Film critics may be less interested in validating the model than in seeing whether it tells them anything new about character. And I think the model may actually have some new things to reveal; among other things, it suggests that the vocabulary used to describe character is strongly coded by genre. In certain genres, characters “flirt,” in others, they “switch” or “are switched.” In some genres, characters merely “defeat” each other; in other genres, they “decapitate” or “are decapitated”!

Since an association with genre is built into the generative assumptions that define the article’s model of character, this might be a predetermined result. But it also raises a hugely interesting question, and there’s lots of room for experimentation here. If the authors’ model of character is too structuralist for your taste, you’re free to sketch a different one and give it a try! Or, if you’re skeptical about our ability to fully “model” character, you could refuse to frame a generative model at all, and just use clustering algorithms in an ad hoc exploratory way to find clues.

Critics will probably also cavil about the dataset (which the authors have generously made available). Do Wikipedia plot summaries tell us about recurring character patterns in film, or do they tell us about the character patterns that are most readily recognized by editors of Wikipedia?

But I think it would be a mistake to cavil. When computer scientists hand you a new tool, the question to ask is not, “Have they used it yet to write innovative criticism?” The question to ask is, “Could we use this?” And clearly, we could.

The approach embodied in this article could be enormously valuable: it could help distant reading move beyond broad stylistic questions and start to grapple with the explicit social content of fiction (and for that matter, nonfiction, which may also rely on implicit schemas of character, as the authors shrewdly point out). Ideally, we would not only map the assumptions about character that typify a given period, but describe how those patterns have changed across time.

Making that work will not be simple: as always, the real problem is the messiness of the data. Applying this technique to actual fictive texts will be a lot harder than applying it to a plot summary. Character names are often left implicit. Many different voices speak; they’re not all equally reliable. And so on.

But the Wordseer Project at Berkeley has begun to address some of these problems. Also, it’s possible that the solution is to scale up instead of sweating the details of coreference resolution: an error rate of 20 or 30% might not matter very much, if you’re looking at strongly marked patterns in a corpus of 40,000 novels.

In any case, this seems to me an exciting lead, worthy of further exploration.

Postscript: Just to illustrate some of the questions that come up: How gendered are character types? The article by Bamman et. al. explicitly models gender as a variable, but the types it ends up identifying are less gender-segregated than I might expect. The heroes and heroines of romantic comedy, for instance, seem to be described in similar ways. Would this also be true in nineteenth-century fiction?

On trolling.

Does our fixation on the character of “the troll” obscure a deeper problem — that the Internet allows us to continuously troll ourselves?

"Troll," by Jolande RM, CC-BY-NC-ND.

“Troll,” by Jolande RM, CC-BY-NC-ND.

Since trolls monopolize every discussion they’re involved in, it should come as no surprise that reflection on trolling itself tends to be preoccupied by the persona of the troll. Wikipedia, for instance, discusses trolling only as a subtopic in its article on “internet trolls.” This sounds straightforward enough: surely, trolling means behaving like a troll. But a more interesting question opens up if we recognize that the verb can float free of the noun — that trolling pervades contemporary discourse, and is performed by everyone.

After all, why does the New York Times write about real estate in the Hamptons, for an audience that mostly can’t afford it? Why does The Atlantic scour every corner of society for trends that prevent professional women from achieving work/life balance? Why do publications for an audience that has already entered or finished grad school run articles advising them not to go to grad school?

They’re all trolling us.

“Wait,” you say. “The way you’re using the word, trolling is just another name for targeted journalistic provocation.”

Trolling may have been perfected by journalists who hold their audience captive in a filter bubble, but trolling is older than journalism. As far as I can tell, Socrates was the first person to practice it. “Why hello there, Gorgias. I hear you’re a rhetorician. By the way, I’ve always wondered, what exactly is rhetoric?”

"Socrates," photo by Sebastià Giralt, CC-BY-NC-SA

“Socrates,” photo by Sebastià Giralt, CC-BY-NC-SA

In fact, Socrates may have been a troll in the noun sense as well, because he clearly enjoyed tormenting interlocutors. But that’s ad hominem and beside the point. I call Socratic discourse “trolling,” not because it was malicious, but because it was in principle interminable. When you first sat down with Socrates, you may have thought “I’m just going to answer this one question and then go buy some olives.” But the first question never gets answered. It always leads on to deeper puzzles, and although you may finally give up and leave, the discourse will be taken up tomorrow by some other victim.

Journalism is, similarly, designed to be interminable. There’s a thin pretense that you’re familiarizing yourself with world events in order to become an informed citizen, but if you actually stopped watching once you had enough information to act, cable news wouldn’t make money.

So I propose to define trolling, generally, as a discourse that is structurally incapable of reaching the conclusion it promises. It seems to be about some determinate object, but either that object endlessly recedes as you approach it, or the rules of the discourse guarantee that other topics can be substituted for the original one, so that a conclusion is never reached.

The Internet is trolling, elevated to Hegelian World Spirit. It’s easy to imagine that people lurk on comment threads denying climate change with endlessly shifting rationales because they are personally insincere, or because online anonymity creates a cool shady place where they can multiply. But in a deeper sense trolls are merely incarnating the structural logic of the Internet. On the Internet, discourse can continue endlessly, unconfined by ordinary social limits. On the Internet, there’s always a new interlocutor — and conversely, there’s always a new provocation, guaranteed to play on your most urgent anxieties, because you designed the filter that selected it yourself.

Of course, once we define trolling this broadly, it becomes nearly useless as a normative concept. It’s hard to locate a line of division between this sort of trolling and legitimate critical reflection. Which will be frustrating, unless you’re a post-structuralist or a troll.

Postscript: The italicized subhed was added on April 22, and wording was changed in minor ways to improve clarity.

The long history of humanistic reaction to sociology.

N+1′s recent editorial on the sociology of taste is worth reading. Whatever it gets wrong, it’s probably right about the real source of tension in the humanities* right now.

People spend a lot of time arguing about the disruptive effects of technology. But if the humanities were challenged primarily by online delivery of recorded lectures, I would sleep very well at night.

The challenge humanists are confronting springs from social rather than technological change. And n+1 is right that part of the problem involves cynicism about the model of culture that justified the study of literature and other arts in the twentieth century. For much of that century, humanists felt comfortable claiming that their disciplines conveyed a kind of cultivation that transcended mere specialized learning. You learned about literary form not because it was in itself useful, but because it transformed you in a way that gave you full possession of a collective human legacy. I have to admit that the sociology of culture has made it harder to write sentences like that last one with a straight face. “Transformation” and “possession” are too obviously metaphors for cultural distinction.

John Guillory, Cultural Capital, Chicago, 1993.

John Guillory, Cultural Capital, Chicago, 1993.

This isn’t to say that Pierre Bourdieu and John Guillory are personally responsible for our predicament. I remember reading Guillory in 1993, and Cultural Capital didn’t come as a great shock. Rather, it seemed to explain, more candidly than usual, a state of imperial unclothedness that sidelong glances had already led most of us to privately suspect.

The n+1 editorial seems weakest when it tries to inflate this recent dilemma for humanists into a broader crisis for left politics or individual agency as such. If social theory necessarily sapped individuals’ will to action, we would be in very hot water indeed! We’d have to avoid reading Marx, as well as Bourdieu. But social analysis can of course coexist with a commitment to social change, and it’s not clear that the sociology of culture has done anything to undermine that commitment. The solidarity of middle and working classes against oligarchic power may even be in better shape today than it was in 1993.

That’s a bit beside the point, however, because n+1 doesn’t seem primarily interested in politics as such. They cite a few dubiously representative examples of contemporary(ish) political(ish) debate (e.g., David Brooks on bobos). But their heart seems to be in the academy, and their real concern appears to be that sociology is undermining academic humanists’ ability to defend their own institutions forcefully, untroubled by any doubt that those institutions merely reproduce cultural distinction. At least that’s what I infer when the editors write that “the spokespeople most effectively diminished by Bourdieu’s influence turn out to be those already in the precarious position of having to articulate and transmit a language of aesthetic experience that could remain meaningful outside either a regime of status or a regime of productivity.”

But here it seems to me that the editors are conflating two conversations. On the one hand, there’s a social and institutional debate about reforming and/or defending specific academic disciplines. On the other, there’s an abstract debate about the tension between social analysis and “aesthetic experience.” The rationale for treating them as the same seems weak.

Bowie, Heroes, 45 rpm, photo by Affendaddy. CC-BY-NC-SA.

Bowie, Heroes, 45 rpm, photo by Affendaddy. CC-BY-NC-SA.

For after all, aesthetic appreciation is doing just fine these days: the sociology of culture hasn’t even dented it. I don’t find my appreciation of David Bowie, for instance, even slightly compromised when I acknowledge that he concocted a specific kind of glamour out of racial, national, gender, and class identities. A historically specific fabulousness is no less fabulous.

The social specificity of Bowie’s glam does, on the other hand, complicate the kind of rationale I could provide for requiring students to study his music. It makes it harder to invoke him as a vehicle for a general cultivation that transcends mere specialized learning. And that’s why the sociology of culture has posed a problem for the humanities: not that it undermines aesthetic discourse as such, but that it complicates claims about the social necessity of aesthetic cultivation.

This is a real dilemma that I can’t begin to resolve in a blog post; instead I’ll just gesture at recent scholarly conversation on the topic broadly construed, including articles, courses, and presentations by Rachel Buurma, James English, Andrew Goldstone, and Laura Heffernan, among others.

The one detail I’d like to add to that conversation is that the concept of “the humanities” we are now tempted to defend may have been shaped in the early twentieth century by a reaction to social science rather like the reaction n+1 is now articulating.

It has been almost completely erased from the discipline’s collective memory, but between 1895 and 1925, literary studies came rather close to becoming a social science. The University of Chicago had a “Professor of Literary Theory and Interpretation” in 1903 — and what literary theory meant, at the time, was an ambitious project to articulate general laws of historical development for literary form. At other institutions this project was often called “general literatology” or “comparative literature,” but it had little in common with contemporary comparative literature. If you go back and read H. M. Posnett’s Comparative Literature (1886), you discover a project that resembles comparative anthropology more than contemporary literary study.

This period of the discipline’s history is now largely forgotten. English professors remember Matthew Arnold; we remember the New Criticism, and we vaguely remember that there was something dusty called “philology” in between. But we probably don’t remember that Chicago had a Professorship of (anthropologically conceived) “Literary Theory” in 1903.

The reason we don’t remember is that there was intense and effective push-back against the incorporation of social sciences (including history) in the study of arts and letters. The reaction stretched from works like Norman Foerster’s American Scholar (1929) to René Wellek’s widely-reprinted Theory of Literature (1949), and it argued at times rather explicitly that social-scientific approaches to culture would reduce the prestige of the arts by undermining the authority of personal cultivation. (One might almost say that critics of this period foresaw the danger posed by Bourdieu.)

humanitiesIt may not be an accident that this was also the period when a concept of “the humanities” (newly identified as an alternative to social science) became institutionally central in American universities (see Geoffrey Harpham’s Humanities and the Dream of America and my related blog post).

I’ll have a little more to say about the anthropologically-ambitious literary theory of the early twentieth century in a book forthcoming this summer (Why Literary Periods Mattered, Stanford UP). I don’t expect that book will resolve contemporary tension between the humanities and social sciences, but I do want to point out that the debate has been going on for more than a hundred years, and that it has constituted the humanities as a distinct entity as least as much as it has threatened them.

Postscript: For a response to n+1 by an actual sociologist of culture, see whatisthewhat.

* Postscript two days later: I now disagree with one aspect of this post — the way its opening paragraphs talk generally about a challenge “for the humanities.” Actually, it’s not clear to me that Bourdieu et. al have posed a problem for historians. I was describing a challenge “for the study of literature and the arts,” and I ought to have said that specifically. In fact, the tendency to inflate doubts about a specific model of literary culture into a generalized “crisis in the humanities” is part of what’s wrong with the n+1 editorial, and part of what I ought to be taking aim at here. But I guess blogging is about learning in public.

Distant reading and representativeness.

Digital collections are vastly expanding literary scholars’ field of view: instead of describing a few hundred well-known novels, we can now test our claims against corpora that include tens of thousands of works. But because this expansion of scope has also raised expectations, the question of representativeness is often discussed as if it were a weakness rather than a strength of digital methods. How can we ever produce a corpus complete and balanced enough to represent print culture accurately?

I think the question is wrongly posed, and I’d like to suggest an alternate frame. As I see it, the advantage of digital methods is that we never need to decide on a single model of representation. We can and should keep enlarging digital collections, to make them as inclusive as possible. But no matter how large our collections become, the logic of representation itself will always remain open to debate. For instance, men published more books than women in the eighteenth century. Would a corpus be correctly balanced if it reproduced those disproportions? Or would a better model of representation try to capture the demographic reality that there were roughly as many women as men? There’s something to be said for both views.

Scott Weingart tweet.To take another example, Scott Weingart has pointed out that there’s a basic tension in text mining between measuring “what was written” and “what was read.” A corpus that contains one record for every title, dated to its year of first publication, would tend to emphasize “what was written.” Measuring “what was read” is harder: a perfect solution would require sales figures, reviews, and other kinds of evidence. But, as a quick stab at the problem, we could certainly measure “what was printed,” by including one record for every volume in a consortium of libraries like HathiTrust. If we do that, a frequently-reprinted work like Robinson Crusoe will carry about a hundred times more weight than a novel printed only once.

We’ll never create a single collection that perfectly balances all these considerations. But fortunately, we don’t need to: there’s nothing to prevent us from framing our inquiry instead as a comparative exploration of many different corpora balanced in different ways.

For instance, if we’re troubled by the difference between “what was written” and “what was read,” we can simply create two different collections — one limited to first editions, the other including reprints and duplicate copies. Neither collection is going to be a perfect mirror of print culture. Counting the volumes of a novel preserved in libraries is not the same thing as counting the number of its readers. But comparing these collections should nevertheless tell us whether the issue of popularity makes much difference for a given research question.

I suspect in many cases we’ll find that it makes little difference. For instance, in tracing the development of literary language, I got interested in the relative prominence of words that entered English before and after the Norman Conquest — and more specifically, in how that ratio changed over time in different genres. My first approach to this problem was based on a collection of 4,275 volumes that were, for the most part, limited to first editions (773 of these were prose fiction).

But I recognized that other scholars would have questions about the representativeness of my sample. So I spent the last year wrestling with 470,000 volumes from HathiTrust; correcting their OCR and using classification algorithms to separate fiction from the rest of the collection. This produced a collection with a fundamentally different structure — where a popular work of fiction could be represented by dozens or scores of reprints scattered across the timeline. What difference did that make to the result? (click through to enlarge)

The same question posed to two different collections. 773 hand-selected first editions on the left; on the right, 47,549 volumes, including many translations and reprints.

The same question posed to two different collections. 773 hand-selected first editions on the left; on the right, 47,549 volumes, including many translations and reprints. Yearly ratios are plotted rather than individual works.


It made almost no difference. The scatterplots look different, of course, because the hand-selected collection (on the left) is relatively stable in size across the timespan, and has a consistent kind of noisiness, whereas the HathiTrust collection (on the right) gets so huge in the nineteenth century that noise almost disappears. But the trend lines are broadly comparable, although the collections were created in completely different ways and rely on incompatible theories of representation.

I don’t regret the year I spent getting a binocular perspective on this question. Although in this case changing the corpus made little difference to the result, I’m sure there are other questions where it will make a difference. And we’ll want to consider as many different models of representation as we can. I’ve been gathering metadata about gender, for instance, so that I can ask what difference gender makes to a given question; I’d also like to have metadata about the ethnicity and national origin of authors.

pullquoteBut the broader point I want to make here is that people pursuing digital research don’t need to agree on a theory of representation in order to cooperate.

If you’re designing a shared syllabus or co-editing an anthology, I suppose you do need to agree in advance about the kind of representativeness you’re aiming to produce. Space is limited; tradeoffs have to be made; you can only select one set of works.

But in digital research, there’s no reason why we should ever have to make up our minds about a model of representativeness, let alone reach consensus. The number of works we can select for discussion is not limited. So we don’t need to imagine that we’re seeking a correspondence between the reality of the past and any set of works. Instead, we can look at the past from many different angles and ask how it’s transformed by different perspectives. We can look at all the digitized volumes we have — and then at a subset of works that were widely reprinted — and then at another subset of works published in India — and then at three or four works selected as case studies for close reading. These different approaches will produce different pictures of the past, to be sure. But nothing compels us to make a final choice among them.

Wordcounts are amazing.

People new to text mining are often disillusioned when they figure out how it’s actually done — which is still, in large part, by counting words. They’re willing to believe that computers have developed some clever strategy for finding patterns in language — but think “surely it’s something better than that?

Uneasiness with mere word-counting remains strong even in researchers familiar with statistical methods, and makes us search restlessly for something better than “words” on which to apply them. Maybe if we stemmed words to make them more like concepts? Or parsed sentences? In my case, this impulse made me spend a lot of time mining two- and three-word phrases. Nothing wrong with any of that. These are all good ideas, but they may not be quite as essential as we imagine.

I suspect the core problem is that most of us learned language a long time ago, and have forgotten how much leverage it provides. We can still recognize that syntax might be worthy of analysis — because it’s elusive enough to be interesting. But the basic phenomenon of the “word” seems embarrassingly crude.

Billy Graham, 1949, from the Galt Museum, on Creative Commons.

Baby, 1949, from the Galt Museum, on Creative Commons.

We need to remember that words are actually features of a very, very high-level kind. As a thought experiment, I find it useful to compare text mining to image processing. Take the picture on the right. It’s pretty hard to teach a computer to recognize that this is a picture that contains a face. To recognize that it contains “sitting” and a “baby” would be extraordinarily impressive. And it’s probably, at present, impossible to figure out that it contains a “blanket.”

Working with text is like working with a video where every element of every frame has already been tagged, not only with nouns but with attributes and actions. If we actually had those tags on an actual video collection, I think we’d recognize it as an enormously valuable archive. The opportunities for statistical analysis are obvious! We have trouble recognizing the same opportunities when they present themselves in text, because we take the strengths of text for granted and only notice what gets lost in the analysis. So we ignore all those free tags on every page and ask ourselves, “How will we know which tags are connected? And how will we know which clauses are subjunctive?”

Natural language processing is going to be important for all kinds of reasons — among them, it can eventually tell us which clauses are subjunctive (should we wish to know). But I think it’s a mistake to imagine that text mining is now in a sort of crude infancy, whose real possibilities will only be revealed after NLP matures. Wordcounts are amazing! An enormous amount of our cultural history is already tagged, in a detailed way that is also easy to analyze statistically. That’s not an embarrassingly babyish method: it’s a huge and obvious research opportunity.