How not to do things with words.

In recent weeks, journals published two papers purporting to draw broad cultural inferences from Google’s ngram corpus. The first of these papers, in PLoS One, argued that “language in American books has become increasingly focused on the self and uniqueness” since 1960. The second, in The Journal of Positive Psychology, argued that “moral ideals and virtues have largely waned from the public conversation” in twentieth-century America. Both articles received substantial attention from journalists and blogs; both have been discussed skeptically by linguists and digital humanists. (Mark Liberman’s takes on Language Log are particularly worth reading.)

I’m writing this post because systems of academic review and communication are failing us in cases like this, and we need to step up our game. Tools like Google’s ngram viewer have created new opportunities, but also new methodological pitfalls. Humanists are aware of those pitfalls, but I think we need to work a bit harder to get the word out to journalists, and to disciplines like psychology.

The basic methodological problem in both articles is that researchers have used present-day patterns of association to define a wordlist that they then take as an index of the fortunes of some concept (morality, individualism, etc) over historical time. (In the second study, for instance, words associated with morality were extracted from a thesaurus and crowdsourced using Mechanical Turk.)

The fallacy involved here has little to do with hot-button issues of quantification. A basic premise of historicism is that human experience gets divided up in different ways in different eras. If we crowdsource “leadership” using twenty-first-century reactions on Mechanical Turk, for instance, we’ll probably get words like “visionary” and “professional.” “Loud-voiced” probably won’t be on the list — because that’s just rude. But to Homer, there’s nothing especially noble about working for hire (“professionally”), whereas “the loud-voiced Achilles” is cut out to be a leader of men, since he can be heard over the din of spears beating on shields (Blackwell).

The laws of perspective apply to history as well. We don’t have an objective overview; we have a position in time that produces its own kind of distortion and foreshortening. Photo 2004 by June Ruivivar.

The authors of both articles are dimly aware of this problem, but they imagine that it’s something they can dismiss if they’re just conscientious and careful to choose a good list of words. I don’t blame them; they’re not coming from historical disciplines. But one of the things you learn by working in a historical discipline is that our perspective is often limited by history in ways we are unable to anticipate. So if you want to understand what morality meant in 1900, you have to work to reconstruct that concept; it is not going to be intuitively accessible to you, and it cannot be crowdsourced.

The classic way to reconstruct concepts from the past involves immersing yourself in sources from the period. That’s probably still the best way, but where language is concerned, there are also quantitative techniques that can help. For instance, Ryan Heuser and Long Le-Khac have carried out research on word frequency in the nineteenth-century novel that might superficially look like the psychological articles I am critiquing. (It’s Pamphlet 4 in the Stanford Literary Lab series.) But their work is much more reliable and more interesting, because it begins by mining patterns of association from the period in question. They don’t start from an abstract concept like “individualism” and pick words that might be associated with it. Instead, they find groups of words that are associated with each other, in practice, in nineteenth-century novels, and then trace the history of those groups. In doing so, they find some intriguing patterns that scholars of the nineteenth-century novel are going to need to pay attention to.

It’s also relevant that Heuser and Le-Khac are working in a corpus that is limited to fiction. One of the problems with the Google ngram corpus is that really we have no idea what genres are represented in it, or how their relative proportions may vary over time. So it’s possible that an apparent decline in the frequency of words for moral values is actually a decline in the frequency of certain genres — say, conduct books, or hagiographic biographies. A decline of that sort would still be telling us something about literary culture; but it might be telling us something different than we initially assume from tracing the decline of a word like “fidelity.”

So please, if you know a psychologist, or journalist, or someone who blogs for The Atlantic: let them know that there is actually an emerging interdisciplinary field developing a methodology to grapple with this sort of evidence. Articles that purport to draw historical conclusions from language need to demonstrate that they have thought about the problems involved. That will require thinking about math, but it also, definitely, requires thinking about dilemmas of historical interpretation.

My illustration about “loud-voiced Achilles” is a very old example of the way concepts change over time, drawn via Friedrich Meinecke from Thomas Blackwell, An Enquiry into the Life and Writings of Homer, 1735. The word “professional,” by the way, also illustrates a kind of subtly moralized contemporary vocabulary that Kesebir & Kesebir may be ignoring in their account of the decline of moral virtue. One of the other dilemmas of historical perspective is that we’re in our own blind spot.

Big but not distant.

Big data. I’m tempted to begin “I, too, dislike it,” because the phrase has become a buzzword. To mainstream humanists, it sounds like a perversion. Even people who work in digital humanities protest that DH shouldn’t be normatively identified with big data — and I agree — so generally I keep quiet on the whole vexed question.

Except … there are a lot of grad students out there just starting to look at DH curiously, wondering whether it offers anything useful for their own subfield. In that situation, it’s natural to start by building a small collection that addresses a specific research problem you know about. And that might, in many cases, be a fine approach! But my conscience is nagging at me, because I can see some other, less obvious opportunities that students ought to be informed about.

It’s true that DH doesn’t have to be identified with scale. But the fact remains that problems of scale constitute a huge blind spot for individual researchers, and also define a problem that we know computers can help us explore. And when you first go into an area that was a blind spot for earlier generations of scholars, you’re almost guaranteed to find research opportunities — lying out on the ground like lumps of gold you don’t have to mine.

I'm just saying.

This suggests that it might be a mistake to assume that the most cost-effective way to get started in DH is to define a small collection focused on a particular problem you know about. It might actually be a better strategy to beg, borrow, or steal a large collection — and poke around in it for problems we don’t yet know about.

“But I’m not interested in big statistical generalizations; I care about describing individual works, decades, and social problems.” I understand; that’s a valid goal; but it’s not incompatible with the approach I’m recommending. I think it’s really vital that we do a better job of distinguishing “big data” (the resource) from “distant reading” (a particular interpretive strategy).* Big data doesn’t have to produce distant generalizations; we can use the leverage provided by scale and comparative analysis to crack open small and tightly-focused questions.

I don’t think most humanists have an intuitive grasp of how that “leverage” would work — but topic modeling is a good example. As I play around with topic-modeling large collections, I’m often finding that the process tells me interesting things about particular periods, genres, or works, by revealing how they differ from other relevant points of comparison. Topic modeling doesn’t use scale to identify a “trend” or an “average,” after all; what it does is identify the most salient dimensions of difference in a given collection. If you believe that the significance of a text is defined by its relation to context, then you can see how topic modeling a collection might help us crack open the (relational) significance of individual works.

“But how do we get our hands on the data?” Indeed: there’s the rub. Miriam Posner has recently suggested that the culture surrounding “coding” serves as a barrier that discourages women and minorities from entering certain precincts of DH. I think that’s right, but I’m even more concerned about the barriers embodied in access to data. Coding is actually not all that hard to pick up. Yes, it’s surrounded by gendered assumptions; but still, you can do it over a summer. [Update: Or, where that’s not practical, you can collaborate with someone. At Illinois, Loretta Auvil and Boris Capitanu do kinds of DH programming that are beyond me. I don’t mean to minimize issues of gender here, but I do mean to put “coding” in perspective. It’s not a mysterious, magical key.] By contrast, none of us can build big data on our own (or even in small teams) over the summer. If we don’t watch out, our field could easily slip into a situation where power gravitates to established scholars at large/wealthy research universities.

I’ve tried to address that by making my own data public. I haven’t documented it very well yet, but give me a few weeks. I think peer pressure should be exerted on everyone (especially established scholars) to make their data public at the time of publication. I do understand that some kinds of data can’t be shared because they’re owned by private enterprise. I accept that. But if you’ve supplemented proprietary data with other things you’ve produced on your own: in my opinion, that data should be made public at the time of publication.

Moreover, if you do that, I’m not going to care very much about the mistakes you have made in building your collection. I may think your data is completely biased and unrepresentative, because it includes too much Y and not enough X. But if so, I have an easy solution — which is to take your data, add it to my own collection of X, and other data borrowed from Initiative Z, and then select whatever subset would in my opinion create a balanced and representative collection. Then I can publish my own article correcting your initial, biased result.

Humanists are used to approaching debates about historical representation as if they were zero-sum questions. I suppose we are on some level still imagining this as a debate about canonicity — which is, as John Guillory pointed out, really a debate about space on the syllabus. Space on the syllabus is a zero-sum game. But the process of building big data is not zero-sum; it is cumulative. Every single thing you digitize is more good news for me, even if I shudder at the tired 2007-vintage assumptions implicit in your research agenda.

Personally, I feel the same way about questions of markup and interoperability. It’s all good. If you can give me clean** ascii text files with minimal metadata, I love you. If you can give me TEI with enriched metadata, I love you. I don’t want to waste a lot of breath arguing about which standard is better. In most cases, clean ascii text would be a lot better than what I can currently get.

* I hasten to say that I’m using “distant reading” here as the phrase is commonly deployed in debate — not as Franco Moretti originally used it — because the limitation I’m playing on is not really present in Moretti’s own use of the term. Moretti pointedly emphasizes that the advantage of a distant perspective may be to reveal the relational significance of an individual work.

** And, when I say “clean” — I will definitely settle for a 5% error rate.

Guillory, John. Cultural Capital. Chicago: U. of Chicago Press, 1993.
Moretti, Franco. Graphs, Maps, Trees. New York: Verso, 2005.

[UPDATE: For a different perspective on the question of representativeness, see Katherine D. Harris on Big Data, DH, and Gender. Also, see Roger Whitson, who suggests that linked open data may help us address issues of representation.]

Do humanists get their ideas from anything at all?

My reaction to Stanley Fish’s third column on digital humanities was at first so negative that I thought it not worth writing about. But in the light of morning, there is something here worth discussing. Fish raises a neglected issue that I (and a bunch of other people cited at the end of this post) have been trying to foreground: the role of discovery in the humanities. He raises the issue symptomatically, by suppressing it, but the problem is too important to let that slide.

Fish argues, in essence, that digital humanists let the data suggest hypotheses for them instead of framing hypotheses that are then tested against evidence.

The usual way of doing this is illustrated by my example: I began with a substantive interpretive proposition … and, within the guiding light, indeed searchlight, of that proposition I noticed a pattern that could, I thought be correlated with it. I then elaborated the correlation.

The direction of my inferences is critical: first the interpretive hypothesis and then the formal pattern, which attains the status of noticeability only because an interpretation already in place is picking it out.

The direction is the reverse in the digital humanities: first you run the numbers, and then you see if they prompt an interpretive hypothesis. The method, if it can be called that, is dictated by the capability of the tool.

The underlying element of truth here is that all researchers — humanists and scientists alike — do need to separate the process of discovering a hypothesis from the process of testing it. Otherwise you run into what we unreflecting empiricists call “the problem of data dredging.” If you simply sweep a net through an ocean of data, and frame a conclusion based on whatever you catch, you’re not properly testing anything, because you’re implicitly testing an infinite number of hypotheses that are left unstated — and the significance of any single test is reduced when it’s run as part of a large battery.

That’s true, but it’s also a problem that people who do data mining are quite self-conscious about. It’s why I never stop linking to this xkcd comic about “significance.” And it’s why Matt Wilkens (mistargeted by Fish as an emblem of this interpretive sin) goes through a deliberately iterative process of first framing hypotheses about nineteenth-century geographical imagination and then testing them. (For instance, after noticing that certain states seem especially prominent in 19c American fiction, he tests whether this remains true after you compensate for differences in population size, and then proposes a pair of hypotheses that he suggests will need to be evaluated against additional “test cases.”)

Wiliam Blake, "Satan, Sin, and Death"

More importantly, Fish profoundly misrepresents his own (traditional) interpretive procedure by pretending that the act of interpretation is wholly contained in a single encounter with evidence. On his account we normally begin with a hypothesis (which seems to have sprung, like Sin, fully-formed from our head), and test it against a single sentence.

In reality, of course, our “interpretive proposition” is often suggested by the same evidence that confirms it. Or — more commonly — we derive a hypothesis from one example, and then read patiently through dozens of books until we have gathered enough confirming evidence to write a chapter. This process runs into a different interpretive fallacy: if you keep testing a hypothesis until you’ve confirmed it, you’re not testing it at all. And it’s a bit worse than that, because in practice what we do now is go to a full-text search engine and search for terms that would go together if our assumptions were correct. (In the example Fish offers, this might be “bishops” and “presbyters.”) If you find three sentences where those terms coincide, you’ve got more than enough evidence to prop up an argument, using our richly humanistic (cough, anecdotal) conception of evidence. And of course a full-text search engine can find you three examples of just about anything. But we don’t have to worry about this, because search engines are not tools that dictate a method; they are transparent extensions of our interpretive sensibility.

The basic mistake that Fish is making is this: he pretends that humanists have no discovery process at all. For Fish, the interpretive act is always fully contained in an encounter with a single piece of evidence. How your “interpretive proposition” got framed in the first place is a matter of no consequence: some readers are just fortunate to have propositions that turn out to be correct. Fish is not alone in this idealized model of interpretation; it’s widespread among humanists.

Fish is resisting the assistance of digital techniques, not because they would impose scientism on the humanities, but because they would force us to acknowledge that our ideas do after all come from somewhere — whether a search engine or a commonplace book. But as Peter Stallybrass eloquently argued five years ago in PMLA (h/t Mark Sample) the process of discovery has always been collaborative, and has long — at least since early modernity — been embodied in specific textual technologies.

Stallybrass, Peter. “Against Thinking.” PMLA 122.5 (2007): 1580-1587.
Wilkens, Matthew. “Geolocation Extraction and Mapping of Nineteenth-Century U.S. Fiction.” DHCS 2011.
On the process of embodied play that generates ideas, see also Stephen Ramsay’s book Reading Machines (University of Illinois Press, 2011).