Humanists are already doing text mining; we’re just doing it in a theoretically naive way. Every time we search a database, we use complex statistical tools to sort important documents from unimportant ones. We don’t spend a lot of time talking about this part of our methodology, because search engines hide the underlying math, making the sorting process seem transparent.
But search is not a transparent technology: search engines make a wide range of different assumptions about similarity, relevance, and importance. If (as I’ve argued elsewhere) search engines’ claim to identify obscure but relevant sources has powerfully shaped contemporary historicism, then our critical practice has come to depend on algorithms that other people write for us, and that we don’t even realize we’re using. Humanists quite properly feel that humanistic research ought to be shaped by our own critical theories, not by the whims of Google. But that can only happen if we understand text mining well enough to build — or at least select — tools more appropriate for our discipline.
This isn’t an abstract problem; existing search technology sits uneasily with our critical theory in several concrete ways. For instance, humanists sometimes criticize text mining by noting that words and concepts don’t line up with each other in a one-to-one fashion. This is quite true: but it’s a critique of humanists’ existing search practices, not of embryonic efforts to improve them. Ordinary forms of keyword search are driven by individual words in a literal-minded way; the point of more sophisticated strategies — like topic modeling — is precisely that they pay attention to looser patterns of association in order to reflect the polysemous character of discourse, where concepts always have multiple names and words often mean several different things.
Perhaps more importantly, humanists have resigned themselves to a hermeneutically naive approach when they accept the dart-throwing game called “choosing search terms.” One of the basic premises of historicism is that other social forms are governed by categories that may not line up with our own; to understand another place or time, a scholar needs to begin by eliciting its own categories. Every time we use a search engine to do historical work we give the lie to this premise by assuming that we already know how experience is organized and labeled in, say, seventeenth-century Spain. That can be a time-consuming assumption, if our first few guesses turn out to be wrong and we have to keep throwing darts. But worse, it can be a misleading assumption, if we accept the first or second set of results and ignore concepts whose names we failed to guess. The point of more sophisticated text-mining techniques — like semantic clustering — is to allow patterns to emerge from historical collections in ways that are (if not absolutely spontaneous) at least a bit less slavishly and minutely dependent on the projection of contemporary assumptions.
I don’t want to suggest that we can dispense with search engines; when you already know what you’re looking for, and what it’s called, a naive search strategy may be the shortest path between A and B. But in the humanities you often don’t know precisely what you’re looking for yet, or what it’s called. And in those circumstances, our present search strategies are potentially misleading — although they remain powerful enough to be seductive. In short, I would suggest that humanists are choosing the wrong moment to get nervous about the distorting influence of digital methods. Crude statistical algorithms already shaped our critical practice in the 1990s when we started relying on keyword search; if we want to take back the reins, each humanist is going to need to understand text mining well enough to choose the tools appropriate for his or her own theoretical premises.