A brief outburst about numbers.

In responding to Stanley Fish last week, I tried to acknowledge that the “digital humanities,” in spite of their name, are not centrally about numbers. The movement is very broad, and at the broadest level, it probably has more to do with networked communication than it does with quantitative analysis.

The older tradition of “humanities computing” — which was about numbers — has been absorbed into this larger movement. But it’s definitely the part of DH that humanists are least comfortable with, and it often has to apologize for itself. So, for instance, I’ve spent much of the last year reminding humanists that they’re already using quantitative text mining in the form of search engines — so it can’t be that scary.* Kathleen Fitzpatrick recently wrote a post suggesting that “one key role for a ‘worldly’ digital humanities may well be helping to break contemporary US culture of its unthinking association of numbers with verifiable reality….” Stephen Ramsay’s Reading Machines manages to call for an “algorithmic criticism” while at the same time suggesting that humanists will use numbers in ways that are altogether different from the way scientists use them (or at least different from “scientism,” an admittedly ambiguous term).

I think all three of us (Stephen, Kathleen, and myself) are making strategically necessary moves. Because if you tell humanists that we do (also) need to use numbers the way scientists use them, your colleagues are going to mutter about naïve quests for certainty, shake their heads, and stop listening. So digital humanists are rhetorically required to construct positivist scapegoats who get hypothetically chased from our villages before we can tell people about the exciting new kinds of analysis that are becoming possible. And, to be clear, I think the people I’ve cited (including me) are doing that in fair and responsible ways.

However, I’m in an “eppur si muove” mood this morning, so I’m going to forget strategy for a second and call things the way I see them. <Begin Galilean outburst>

In reality, scientists are not naïve about the relationship between numbers and certainty, because they spend a lot of time thinking about statistics. Statistics is the science of uncertainty, and it insists — as forcefully as any literary theorist could — that every claim comes accompanied by a specific kind of ignorance. Once you accept that, you can stop looking for absolute knowledge, and instead reason concretely about your own relative uncertainty in a given instance. I think humanists’ unfamiliarity with this idea may explain why our critiques of data mining so often taken the form of pointing to a small error buried somewhere in the data: unfamiliarity with statistics forces us to fall back on a black-and-white model of truth, where the introduction of any uncertainty vitiates everything.

Moreover, the branch of statistics most relevant to text mining (Bayesian inference) is amazingly, almost bizarrely willing to incorporate subjective belief into its definition of knowledge. It insists that definitions of probability have to depend not only on observed evidence, but on the “prior probabilities” that we expected before we saw the evidence. If humanists were more familiar with Bayesian statistics, I think it would blow a lot of minds.

I know the line about “lies, damn lies, and so on,” and it’s certainly true that statistics can be abused, as this classic xkcd comic shows. But everything can be abused. The remedy for bad verbal argument is not to “remember that speech should stay in its proper sphere” — it’s to speak better and more critically. Similarly, the remedy for bad quantitative argument is not “remember that numbers have to stay in their proper sphere”; it’s to learn statistics and reason more critically.

possible shapes of the Beta distribution, from Wikpedia


None of this is to say that we can simply borrow tools or methods from scientists unchanged. The humanities have a lot to add — especially when it comes to the social and historical character of human behavior. I think there are fascinating advances taking place in data science right now. But when you take apart the analytic tools that computer scientists have designed, you often find that they’re based on specific mistaken assumptions about the social character of language. For instance, there’s a method called “Topics over Time” that I want to use to identify trends in the written record (Wang and McCallum, 2006). The people who designed it have done really impressive work. But if a humanist takes apart the algorithm underlying this method, they will find that it assumes that every trend can be characterized as a smooth curve called a “Beta distribution.” Whereas in fact, humanists have evidence that the historical trajectory of a topic is often more complex than that, in ways that really matter. So before I can use this tool, I’m going to have to fix that part of the method.

The diachronic behavior a topic can actually exhibit.


But this is a problem that can be fixed, in large part, by fixing the numbers. Humanists have a real contribution to make to the science of data mining, but it’s a contribution that can be embodied in specific analytic insights: it’s not just to hover over the field like the ghost of Ben Kenobi and warn it about hubris.

</Galilean outburst>

For related thoughts, somewhat more temperate than the outburst above, see this excellent comment by Matthew Wilkens, responding to a critique of his work by Jeremy Rosen.

* I credit Ben Schmidt for this insight so often that regular readers are probably bored. But for the record: it comes from him.

10 thoughts on “A brief outburst about numbers.

  1. I think you’re spot on about the relatively widespread lack of understanding among humanists of what science actually involves. In my experience, it’s humanists of a very particular stripe alone who maintain the obsessive fear that someone, somewhere believes numbers don’t require interpretation.

    [pontification] It would be nice if people got a look at working science as part of their education, but that’s probably impossible (labs don’t count, since they’re contrived exercises with known right answers). Failing that, a course in science studies would be a nice thing for scientists and non-scientists alike, and some statistical literacy should probably be part of most everyone’s formation. [/pontification]

  2. I agree. The thing I don’t admit in this post is that “statistical literacy” is also hard. I was a science-y kid growing up in a science-y family … but it’s only in the last year (in my mid-40s) that I’ve actually come to understand fallacies like the one dramatized in that xkcd strip on “significance.” Let alone the Bayesian stuff, which still feels to me like banging my head repeatedly into a wall (albeit a crumbly wall where you can make slow progress that way). Even when humanists get over our suspicion of numbers — and I do agree that we’re getting over it — there are going to be some hurdles here. I fear that data-oriented approaches may remain a small subfield for that reason, as much as anything else …

  3. I think the frequent mia colpa about not being positivists remains an important part of how we talk about numbers. Historical proximity to cliometrics and various other “science of history” kinds of thinking is just going to keep requiring this of us. With that said, you are completely right that things like bayesianism and, at this point, any coherent philosophy of science have become significantly more open to the kinds of knowledge claims that humanists actually do want to make.

    With that said, an important caveat is that the stat that if you just start signing up for statistics courses you are going to end up getting a rundown on using t-tests and ANOVAs as tools for hypothesis testing. The entire hypothesis testing idea remains a core part of how a lot of folks in the social sciences think about things and it is deeply at odds with what humanists want to do. As a piece of evidence, in a lot of social science fields P values remain the currency of the value of a study. This is where your point that the humanities themselves having useful things to bring to work in statistics is particularly interesting.

    I think the time is very ripe for an intro to stats book targeted at folks in the humanities. One that would work through ways of using statistics that are particularly meaningful for the kind of work that humanists do. A cookbook approach would be great (this test is great when dealing with change over time but be aware of these kinds of assumptions sort of thing) but in particular, there remains a need for a translation from the language of hypothesis testing into the kind of argument and evidence and qualification of evidence model that humanities work from.

    In terms of epistomology, a lot of the back and forth around the connections between quantatitive and qualitative research provides some interesting fodder for discussion. For example, Joe Maxwell’s work tends to be very valuable.

    See
    http://www.aera.net/uploadedfiles/journals_and_publications/journals/educational_researcher/volume_33_no_2/2026-02_maxwell.pdf

    and

    http://www.sagepub.com/books/Book226134

    • Thanks. That article by Maxwell is interesting stuff, and I think you’re especially right that we’re going to need a way of translating “hypothesis-testing” into another language.

      I suspect that humanistic arguments will rarely rest primarily on quantitative evidence. It’s going to be useful first of all as a discovery heuristic — helping us locate interesting problems. And then, toward the end of the interpretive process, I suspect it will come in as corroboration — helping to confirm some aspect of our interpretation, perhaps in a minimal way (e.g., “I can say at least that this evidence isn’t inconsistent with my interpretation,” or “I can say at least that this evidence rules out alternative explanation Y.”)

      That’s pretty minimal, but it would still be a big change from our current approach.

      • Agreed. I might add that we already maybe do more hypothesis testing than we let on. If I make a claim about, say, the effects of colonial exploitation on the forms of anglophone fiction, and then I go read some (more) of that fiction looking to see how my claims hold up, it seems to me that it’s reasonable to describe what I’m doing as testing a hypothesis. It’s usually not statistically or scientifically *strict* hypothesis testing, but that’s often OK and in any case doesn’t disqualify my project as one that has a hypothesis and puts it to a test.
        That said, I certainly agree that we don’t often frame our work that way and may not get very far with our colleagues by trying to convince them that this is what we’re (all) up to.

  4. This is in reply to Matt’s 4:40 pm comment in case my lame blog doesn’t place it correctly. I just wanted to say that you’re right, and that one important advantage of thinking in terms of hypothesis-testing is that it would force us to be self-conscious about confirmation/selection bias.

    Right now, the way the game is played, I can frame a literary hypothesis, then go to a full-text search engine and keep searching and searching until I find enough confirming evidence to write an article. The written record is big enough that you almost always will … eventually. I have to admit, I worked that way myself in the late 90s. It’s a data-mining system. It just happens to be a system that’s guaranteed to produce gigantic confirmation bias, for basically the reason illustrated in that xkcd comic.

  5. What do you think about explaining the Bayesian perspective in some detail? To met it seems like it may have the best shot of upsetting preconceptions long enough that folks will actually listen.

    Perhaps we could add Bruno de Finetti’s “Probability does not exist” to the standard quotation of Box’s “All models are wrong…”

    I *think* David Spiegelhalter offered some commentary on the quote in last week’s More or Less: http://www.bbc.co.uk/iplayer/episode/b018gzqx/More_or_Less_30_12_2011/

    • Thanks, Allen. I found the Spiegelhalter bit at about 10:50 in that segment, and it’s Bayesian all right.

      I realized in the process of writing this post that there’s probably an important book or article to be written for humanists on Bayesian statistics. I’m not sure I’m the guy to write it, though … I’m still trying to understand some of the notation! One thing I think I can see is that the Bayesian perspective does a good job of explaining why statistics are relevant even to disciplines that write about the past. “Probability” doesn’t have to be about repeatable experiments … apparently it can be about the contours of our belief or ignorance in a given domain. But I think I’ll stop right there before I get out of my depth.

  6. Pingback: Editors’ Choice: Digital Humanities as a Literary Studies Movement(?): Editors’ Choice Round-up : Digital Humanities Now

  7. I am genuinely pleased to read this website posts which includes
    plenty of helpful information, thanks for providing these kinds of statistics.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s