For most literary scholars, text mining is going to be an exploratory tool.

Having just returned from a conference of Romanticists, I’m in a mood to reflect a bit about the relationship between text mining and the broader discipline of literary studies. This entry will be longer than my usual blog post, because I think I’ve got an argument to make that demands a substantial literary example. But you can skip over the example to extract the polemical thesis if you like!

At the conference, I argued that literary critics already practice a crude form of text mining, because we lean heavily on keyword search when we’re tracing the history of a topic or discourse. I suggested that information science can now offer us a wider range of tools for mapping archives — tools that are subtler, more consonant with our historicism, and maybe even more literary than keyword search is.

At the same time, I understand the skepticism that many literary critics feel. Proving a literary thesis with statistical analysis is often like cracking a nut with a jackhammer. You can do it: but the results are not necessarily better than you would get by hand.

One obvious solution would be to use text mining in an exploratory way, to map archives and reveal patterns that a critic could then interpret using nuanced close reading. I’m finding that approach valuable in my own critical practice, and I’d like to share an example of how it works. But I also want to reflect about the social forces that stand in the way of this obvious compromise between digital and mainstream humanists — leading both sides to assume that quantitative analysis ought to contribute instead by proving literary theses with increased certainty.

Part of a topic tree based on a generically diverse collection of 2200 18c texts.


I’ll start with an example. If you don’t believe text mining can lead to literary insights, bear with me: this post starts with some goofy-looking graphs, but develops into an actual hypothesis about the Romantic novel based on normal kinds of literary evidence. But if you’re willing to take my word that text-mining can produce literary leads, or simply aren’t interested in Romantic-era fiction, feel free to skip to the end of this (admittedly long!) post for the generalizations about method.

Several months ago, when I used hierarchical clustering to map eighteenth-century diction on this blog, I pointed to a small section of the resulting tree that intriguingly mixed language about feeling with language about time. It turned out that the words in this section of the tree were represented strongly in late-eighteenth-century novels (novels, for instance, by Frances Burney, Sophia Lee, and Ann Radcliffe). Other sections of the tree, associated with poetry or drama, had a more vivid kind of emotive language, and I wondered why novels would combine an emphasis on feeling or exclamation (“felt,” “cried”) with the abstract topic of duration (“moment,” “longer”). It seemed an oddly phenomenological way to think about emotion.

But I also realized that hierarchical clustering is a fairly crude way of mapping conceptual space in an archive. The preferred approach in digital humanities right now is topic modeling, which does elegantly handle problems like polysemy. However, I’m not convinced that existing methods of topic modeling (LDA and so on) are flexible enough to use for exploration. One of their chief advantages is that they don’t require the human user to make judgment calls: they automatically draw boundaries around discrete “topics.” But for exploratory purposes boundaries are not an advantage! In exploring an archive, the goal is not to eliminate ambiguity so that judgment calls are unnecessary: the goal is to reveal intriguing ambiguities, so that the human user can make judgments about them.

If this is our goal, it’s probably better to map diction as an associative web. Fortunately, it was easy to get from the tree to a web, because the original tree had been based on an algorithm that measured the strength of association between any two words in the collection. Using the same algorithm, I created a list of twenty-five words most strongly associated with the branch that had interested me (“instantly,” “cried,” “felt,” moment,” “longer,”) and then used the strengths of association between those words to model the whole list as a force-directed graph. In this graph, words are connected by “springs” that pull them closer together; the darker the line, the stronger the association between the two words, and the more tightly they will be bound together in the graph. (The sizes of words are loosely proportional to their frequency in the collection, but only very loosely.)

A graph like this is not meant to be definitive: it’s a brainstorming tool that helps me explore associations in a particular collection (here, a generically diverse collection of eighteenth-century writing). On the left side, we see a triangle of feminine pronouns (which are strongly represented in the same novels where “felt,” “moment,” and so on are strongly represented) as well as language that defines domestic space (“quitting,” “room”). On the right side of the graph, we see a range of different kinds of emotion. And yet, looking at the graph as a whole, there is a clear emphasis on an intersection of feeling and time — whether the time at issue is prospective (“eagerly,” “hastily,” “waiting”) or retrospective (“recollected,” “regret”).

In particular, there are a lot of words here that emphasize temporal immediacy, either by naming a small division of time (“moment,” “instantly”), or by defining a kind of immediate emotional response (“surprise,” “shocked,” “involuntarily”). I have highlighted some of these words in red; the decision about which words to include in the group was entirely a human judgment call — which means that it is open to the same kind of debate as any other critical judgment.

But the group of words I have highlighted in red — let’s call it a discourse of temporal immediacy — does turn out to have an interesting historical profile. We already know that this discourse was common in late-eighteenth-century novels. But we can learn more about its provenance by restricting the generic scope of the collection (to fiction) and expanding its temporal scope to include the nineteenth as well as eighteenth centuries. Here I’ve graphed the aggregate frequency of this group of words in a collection of 538 works of eighteenth- and nineteenth-century fiction, plotted both as individual works and as a moving average. [The moving average won’t necessarily draw a line through the center of the “cloud,” because these works vary greatly in size. For instance, the collection includes about thirty of Hannah More’s “Cheap Repository Tracts,” which are individually quite short, and don’t affect the moving average more than a couple of average-sized novels would, although they create an impressive stack of little circles in the 1790s.]

The shape of the curve here suggests that we’re looking at a discourse that increased steadily in prominence through the eighteenth century and peaked (in fiction) around the year 1800, before sinking back to a level that was still roughly twice its early-eighteenth-century frequency.

Why might this have happened? It’s always a good idea to start by testing the most boring hypothesis — so a first guess might be that words like “moment” and “instantly” were merely displacing some set of close synonyms. But in fact most of the imaginable synonyms for this set of words correlate very closely with them. (This is true, for instance, of words like “sudden,” “abruptly,” and “alarm.”)

Another way to understand what’s going on would be to look at the works where this discourse was most prominent. We might start by focusing on the peak between 1780 and 1820. In this period, the works of fiction where language of temporal immediacy is most prominent include

    Charlotte Dacre, Zofloya (1806)
    Charlotte Lennox, Hermione, or the Orphan Sisters (1791)
    M. G. Lewis, The Monk (1796)
    Ann Radcliffe, A Sicilian Romance (1790), The Castles of Athlin and Dunbayne (1789), and The Romance of the Forest (1792)
    Frances Burney, Cecilia (1782) and The Wanderer (1814)
    Amelia Opie, Adeline Mowbray (1805)
    Sophia Lee, The Recess (1785)

There is a strong emphasis here on the Gothic, but perhaps also, more generally, on women writers. The works of fiction where the same discourse is least prominent would include

    Hannah More, most of her “Cheap Repository Tracts” and Coelebs in Search of a Wife (1809)
    Robert Bage, Hermsprong (1796)
    John Trusler, Life; or the Adventures of William Ramble (1793)
    Maria Edgeworth, Castle Rackrent (1800)
    Arnaud Berquin, The Children’s Friend (1788)
    Isaac Disraeli, Vaurien; or, Sketches of the Times (1797)

Many of these works are deliberately old-fashioned in their approach to narrative form: they are moral parables, or stories for children, or first-person retrospective narratives (like Rackrent), or are told by Fieldingesque narrators who feel free to comment and summarize extensively (as in the works by Disraeli and Trusler).

After looking closely at the way the language of temporal immediacy is used in Frances Burney, Cecilia (1782), and Sophia Lee, The Recess (1785), it seems to me that it had both a formal and an affective purpose.

Formally, it foregrounded a newly sharp kind of temporal framing. If we believe Ian Watt, part of the point of the novel form is to emulate the immediacy of first-hand experience — a purpose that can be slightly at odds with the retrospective character of narrative. Eighteenth-century novelists fought the distancing effect of retrospection in a lot of ways: epistolary narrative, discovered journals and so on are ways of bringing the narrative voice as close as possible to the moment of experience. But those tricks have limits; at some point, if your heroine keeps running off to write breathless letters between every incident, Henry Fielding is going to parody you.

By the late eighteenth century it seems to me novelists were starting to work out ways of combining temporal immediacy with ordinary retrospective narration. Maybe you can’t literally have your narrator describe events as they’re taking place, but you can describe events in a way that highlights their temporal immediacy. This is one of the things that makes Frances Burney read more like a nineteenth-century novelist than like Defoe; she creates a tight temporal frame for each event, and keeps reminding her readers about the tightness of the frame. So, a new paragraph will begin “A few moments after he was gone …” or “At that moment Sir Robert himself burst into the Room …” or “Cecilia protested she would go instantly to Mr Briggs,” to choose a few examples from a single chapter of Cecilia (my italics, 363-71). We might describe this vaguely as a way of heightening suspense — but there are of course many different ways to produce suspense in fiction. Narratology comes closer to the question at issue when it talks about “pacing,” but unless someone has already coined a better term, I think I would prefer to describe this late-18c innovation as a kind of “temporal framing,” because the point is not just that Burney uses “scene” rather than “summary” to make discourse time approximate story time — but that she explicitly divides each “scene” into a succession of discrete moments.

There is a lot more that could be said about this aspect of narrative form. For one thing, in the Romantic era it seems related to a particular way of thinking about emotion — a strategy that heightens emotional intensity by describing experience as if it were divided into a series of instananeous impressions. E.g, “In the cruelest anxiety and trepidation, Cecilia then counted every moment till Delvile came …” (Cecilia, 613). Characters in Gothic fiction are “every moment expecting” some start, shock, or astonishment. “The impression of the moment” is a favorite phrase for both Burney and Sophia Lee. On one page of The Recess, a character “resign[s] himself to the impression of the moment,” although he is surrounded by a “scene, which every following moment threatened to make fatal” (188, my italics).

In short, fairly simple tools for mapping associations between words can turn up clues that point to significant formal, as well as thematic, patterns. Maybe I’m wrong about the historical significance of those patterns, but I’m pretty sure they’re worth arguing about in any case, and I would never have stumbled on them without text mining.

On the other hand, when I develop these clues into a published article, the final argument is likely to be based largely on narratology and on close readings of individual texts, supplemented perhaps by a few simple graphs of the kind I’ve provided above. I suppose I could master cutting-edge natural language processing, in order to build a fabulous tool that would actually measure narrative pace, and the division of scenes into incidents. That would be fun, because I love coding, and it would be impressive, since it would prove that digital techniques can produce literary evidence. But the thing is, I already have an open-source application that can measure those aspects of narrative form, and it runs on inexpensive hardware that requires only water, glucose, and caffeine.

The methodological point I want to make here is that relatively simple forms of text mining, based on word counts, may turn out to be the techniques that are in practice most useful for literary critics. Moreover, if I can speak frankly: what makes this fact hard for us to acknowledge is not technophilia per se, but the nature of the social division between digital humanists and mainstream humanists. Literary critics who want to dismiss text mining are fond of saying “when you get right down to it, it’s just counting words.” (At moments like this we seem to forget everything 20c literary theorists ever learned from linguistics, and go back to treating language as a medium that ideally, ought to be immaterial and transparent. Surely a crudely verbal approach — founded on lumpy, ambiguous words — can never tell us anything about the conceptual subtleties of form and theme!) Stung by that critique, digital humanists often feel we have to prove that our tools can directly characterize familiar literary categories, by doing complex analyses of syntax, form, and sentiment.

I don’t want to rule out those approaches; I’m not interested in playing the game “Computers can never do X.” They probably can do X. But we’re already carrying around blobs of wetware that are pretty good at understanding syntax and sentiment. Wetware is, on the other hand, terrible at counting several hundred thousand words in order to detect statistical clues. And clues matter. So I really want to urge humanists of all stripes to stop imagining that text mining has to prove its worth by proving literary theses.

That should not be our goal. Full-text search engines don’t perform literary analysis at all. Nor do they prove anything. But literary scholars find them indispensable: in fact, I would argue that search engines are at least partly responsible for the historicist turn in recent decades. If we take the same technology used in those engines (a term-document matrix plus vector space math), and just turn the matrix on its side so that it measures the strength of association between terms rather than documents, we will have a new tool that is equally valuable for literary historians. It won’t prove any thesis by itself, but it can uncover a whole new range of literary questions — and that, it seems to me, ought to be the goal of text mining.

References
Frances Burney, Cecilia; or, Memoirs of an Heiress, ed. Peter Sabor and Margaret Ann Doody (Oxford: OUP, 1999).
Sophia Lee, The Recess; or, a Tale of Other Times, ed. April Alliston (Lexington: UP of Kentucky, 2000).

[Postscript: This post, originally titled “How to make text mining serve literary history,” is a version of a talk I gave at NASSR 2011 in Park City, Utah, sharpened by the discussion that took place afterward. I’d like to thank the organizers of the conference (Andrew Franta and Nicholas Mason) as well as my co-panelists (Mark Algee-Hewitt and Mark Schoenfield) and everyone in the audience. The original slides are here; sometimes PowerPoint can be clearer than prose.

I don’t mean to deny, by the way, that the simple tools I’m using could be refined in many ways — e.g., they could include collocations. What I’m saying is I that don’t think we need to wait for technical refinements. Our text-mining tools are already sophisticated enough to produce valuable leads, and even after we make them more sophisticated, it will remain true that at some point in the critical process we have to get off the bicycle and walk.]