Genre, gender, and point of view.

A paper for the IEEE “big humanities” workshop, written in collaboration with Michael L. Black, Loretta Auvil, and Boris Capitanu, is available on arXiv now as a preprint.

The Institute of Electrical and Electronics Engineers is an odd venue for literary history, and our paper ends up touching so many disciplinary bases that it may be distracting.* So I thought I’d pull out four issues of interest to humanists and discuss them briefly here; I’m also taking the occasion to add a little information about gender that we uncovered too late to include in the paper itself.

1) The overall point about genre. Our title, “Mapping Mutable Genres in Structurally Complex Volumes,” may sound like the sort of impossible task heroines are assigned in fairy tales. But the paper argues that the blurry mutability of genres is actually a strong argument for a digital approach to their history. If we could start from some consensus list of categories, it would be easy to crowdsource the history of genre: we’d each take a list of definitions and fan out through the archive. But centuries of debate haven’t yet produced stable definitions of genre. In that context, the advantage of algorithmic mapping is that it can be comprehensive and provisional at the same time. If you change your mind about underlying categories, you can just choose a different set of training examples and hit “run” again. In fact we may never need to reach a consensus about definitions in order to have an interesting conversation about the macroscopic history of genre.

2) A workset of 32,209 volumes of English-language fiction. On the other hand, certain broad categories aren’t going to be terribly controversial. We can probably agree about volumes — and eventually specific page ranges — that contain (for instance) prose fiction and nonfiction, narrative and lyric poetry, and drama in verse, or prose, or some mixture of the two. (Not to mention interesting genres like “publishers’ ads at the back of the volume.”) As a first pass at this problem, we extract a workset of 32,209 volumes containing prose fiction from a collection of 469,200 eighteenth- and nineteenth-century volumes in HathiTrust Digital Library. The metadata for this workset is publicly available from Illinois’ institutional repository. More substantial page-level worksets will soon be produced and archived at HathiTrust Research Center.

3) The declining prevalence of first-person narration. Once we’ve identified this fiction workset, we switch gears to consider point of view — frankly, because it’s a temptingly easy problem with clear literary significance. Though the fiction workset we’re using is defined more narrowly than it was last February, we confirm the result I glimpsed at that point, which is that the prevalence of first-person point of view declines significantly toward the end of the eighteenth century and then remains largely stable for the nineteenth.

Mean probability that fiction is written in first person, 1700-1899. Based on a corpus of 32,209 volumes of fiction extracted from HathiTrust Digital Library. Points are mean probabilities for five-year spans of time; a trend line with standard errors has been plotted with loess smoothing.

Mean probability that fiction is written in first person, 1700-1899. Based on a corpus of 32,209 volumes of fiction extracted from HathiTrust Digital Library. Points are mean probabilities for five-year spans of time; a trend line with standard errors has been plotted with loess smoothing.

We can also confirm that result in a way I’m finding increasingly useful, which is to test it in a collection of a completely different sort. The HathiTrust collection includes reprints, which means that popular works have more weight in the collection than a novel printed only once. It also means that many volumes carry a date much later than their first date of publication. In some ways this gives a more accurate picture of print culture (an approximation to “what everyone read,” to borrow Scott Weingart’s phrase), but one could also argue for a different kind of representativeness, where each volume would be included only once, in a record dated to its first publication (an attempt to represent “what everyone wrote”).

Mean probability that fiction is written in first person, 1700-1899. Based on a corpus of 774 volumes of fiction selected by multiple hands from multiple sources. Plotted in 20-year bins because n is small here. Works are weighted by the number of words they contain.

Mean probability that fiction is written in first person, 1700-1899. Based on a corpus of 774 volumes of fiction selected by multiple hands from multiple sources. Plotted in 20-year bins because n is smaller here. Works are weighted by the number of words they contain.

Fortunately, Jordan Sellers and I produced a collection like that a few years ago, and we can run the same point-of-view classifier on this very different set of 774 fiction volumes (metadata available), selected by multiple hands from multiple sources (including TCP-ECCO, the Brown Women Writers Project, and the Internet Archive). Doing that reveals broadly the same trend line we saw in the HathiTrust collection. No collection can be absolutely representative (for one thing, because we don’t agree on what we ought to be representing). But discovering parallel results in collections that were constructed very differently does give me some confidence that we’re looking at a real trend.

4. Gender and point of view. In the process of classifying works of fiction, we stumbled on interesting thematic patterns associated with point of view. Features associated with first-person perspective include first-person pronouns, obviously, but also number words and words associated with sea travel. Some of this association may be explained by the surprising persistence of a particular two-century-long genre, the Robinsonade. A castaway premise obviously encourages first-person narration, but the colonial impulse in the Robinsonade also seems to have encouraged acquisitive enumeration of the objects (goats, barrels, guns, slaves) its European narrators find on ostensibly deserted islands. Thus all the number words. (But this association of first-person perspective with colonial settings and acquisitive enumeration may well extend beyond the boundaries of the Robinsonade to other genres of adventure fiction.)

Third-person perspective, on the other hand, is durably associated with words for domestic relationships (husband, lover, marriage). We’re still trying to understand these associations; they could be consequences of a preference for third-person perspective in, say, courtship fiction. But third-person pronouns correlate particularly strongly with words for feminine roles (girl, daughter, woman) — which suggests that there might also be a more specifically gendered dimension to this question.

Since transmitting our paper to the IEEE I’ve had a chance to investigate this hypothesis in the smaller of the two collections we used for that paper — 774 works of fiction between 1700 and 1899: 521 by men, 249 by women, and four not characterized by gender. (Mike Black and Jordan Sellers recorded this gender data by hand.) In this collection, it does appear that male writers choose first-person perspective significantly more than women do. The gender gap persists across the whole timespan, although it might be fading toward the end of the nineteenth century.

Proportion of works of fiction by men and women in first person. Based on the same set of 774 volumes described above. (This figure counts strictly by the number of works rather than weighting works by the number of words they contain.)

Proportion of works of fiction by men and women in first person. Based on the same set of 774 volumes described above. (This figure counts strictly by the number of works rather than weighting works by the number of words they contain.)

Over the whole timespan, women use first person in roughly 23% of their works, and men use it in roughly 35% of their works.** That’s not a huge difference, but in relative terms it’s substantial. (Men are using first person 52% more than women). The Bayesian mafia have made me wary of p-values, but if you still care: a chi-squared test on the 2×2 contingency table of gender and point of view gives p < 0.001. (Attentive readers may already be wondering whether the decline of first person might be partly explained by an increase in the proportion of women writers. But actually, in this collection, works by women have a distribution that skews slightly earlier than that of works by men.)

These are very preliminary results. 774 volumes is a small set when you could test 32,209. At the recent HTRC Uncamp, Stacy Kowalczyk described a method for gender identification in the larger HathiTrust corpus, which we will be eager to borrow once it’s published. Also, the mere presence of an association between gender and point of view doesn’t answer any of the questions literary critics will really want to pose about this phenomenon — like, why is point of view associated with gender? Is this actually a direct consequence of gender, or is it an indirect consequence of some other variable like genre? Does this gendering of narrative perspective really fade toward the end of the nineteenth century? I don’t pretend to have answered any of those questions, all I’m doing here is flagging the existence of an interesting open question that will deserve further inquiry.

— — — — —

*Other papers for the panel are beginning to appear online. Here’s “Infectious Texts: Modeling Text Reuse in Nineteenth-Century Newspapers,” by David A. Smith, Ryan Cordell, and Elizabeth Maddock Dillon.

** We don’t actually represent point of view as a binary choice between first person or third person; the classifier reports probabilities as a continuous range between 0 and 1. But for purposes of this blog post I’ve simplified by dividing the works into two sets at the 0.5 mark. On this point, and for many other details of quantitative methodology, you’ll want to consult the paper itself.

A new approach to the history of character?

In Macroanalysis, Matt Jockers points out that computational stylistics has found it hard to grapple with “the aspects of writing that readers care most deeply about, namely plot, character, and theme” (118). He then proceeds to use topic modeling to pretty thoroughly anatomize theme in the nineteenth-century novel. One down, I guess, two to go!

But plot and character are probably harder than theme; it’s not yet clear how we would trace those patterns in thousands of volumes. So I think it may be worth flagging a very promising article by David Bamman, Brendan O’Connor, and Noah A. Smith. Computer scientists don’t often develop a new methodology that could seriously enrich criticism of literature and film. But this one deserves a look. (Hat tip to Lynn Cherny, by the way, for this lead.)

Emotion-Masks-760092The central insight in the article is that character can be modeled grammatically. If you can use natural language processing to parse sentences, you should be able to identify what’s being said about a given character. The authors cleverly sort “what’s being said” into three questions: what does the character do, what do they suffer or undergo, and what qualities are attributed to them? The authors accordingly model character types (or “personas”) as a set of three distributions over these different domains. For instance, the ZOMBIE persona might do a lot of “eating” and “killing,” get “killed” in turn, and find himself described as “dead.”

The authors try to identify character types of this kind in a collection of 42,306 movie plot summaries extracted from Wikipedia. The model they use is a generative one, which entails assumptions that literary critics would call “structuralist.” Movies in a given genre have a tendency to rely on certain recurring character types. Those character types in turn “generate” the specific characters in a given story, which in turn generate the actions and attributes described in the plot summary.

Using this model, they reason inward from both ends of the process. On the one hand, we know the genres that particular movies belong to. On the other hand, we can see that certain actions and attributes tend to recur together in plot summaries. Can we infer the missing link in this process — the latent character types (“personas”) that mediate the connection from genre to action?

It’s a very thoughtful model, both mathematically and critically. Does it work? Different disciplines will judge success in different ways. Computer scientists tend to want to validate a model against some kind of ground truth; in this case they test it against character patterns described by fans on TV Tropes. Film critics may be less interested in validating the model than in seeing whether it tells them anything new about character. And I think the model may actually have some new things to reveal; among other things, it suggests that the vocabulary used to describe character is strongly coded by genre. In certain genres, characters “flirt,” in others, they “switch” or “are switched.” In some genres, characters merely “defeat” each other; in other genres, they “decapitate” or “are decapitated”!

Since an association with genre is built into the generative assumptions that define the article’s model of character, this might be a predetermined result. But it also raises a hugely interesting question, and there’s lots of room for experimentation here. If the authors’ model of character is too structuralist for your taste, you’re free to sketch a different one and give it a try! Or, if you’re skeptical about our ability to fully “model” character, you could refuse to frame a generative model at all, and just use clustering algorithms in an ad hoc exploratory way to find clues.

Critics will probably also cavil about the dataset (which the authors have generously made available). Do Wikipedia plot summaries tell us about recurring character patterns in film, or do they tell us about the character patterns that are most readily recognized by editors of Wikipedia?

But I think it would be a mistake to cavil. When computer scientists hand you a new tool, the question to ask is not, “Have they used it yet to write innovative criticism?” The question to ask is, “Could we use this?” And clearly, we could.

The approach embodied in this article could be enormously valuable: it could help distant reading move beyond broad stylistic questions and start to grapple with the explicit social content of fiction (and for that matter, nonfiction, which may also rely on implicit schemas of character, as the authors shrewdly point out). Ideally, we would not only map the assumptions about character that typify a given period, but describe how those patterns have changed across time.

Making that work will not be simple: as always, the real problem is the messiness of the data. Applying this technique to actual fictive texts will be a lot harder than applying it to a plot summary. Character names are often left implicit. Many different voices speak; they’re not all equally reliable. And so on.

But the Wordseer Project at Berkeley has begun to address some of these problems. Also, it’s possible that the solution is to scale up instead of sweating the details of coreference resolution: an error rate of 20 or 30% might not matter very much, if you’re looking at strongly marked patterns in a corpus of 40,000 novels.

In any case, this seems to me an exciting lead, worthy of further exploration.

Postscript: Just to illustrate some of the questions that come up: How gendered are character types? The article by Bamman et. al. explicitly models gender as a variable, but the types it ends up identifying are less gender-segregated than I might expect. The heroes and heroines of romantic comedy, for instance, seem to be described in similar ways. Would this also be true in nineteenth-century fiction?

We don’t already understand the broad outlines of literary history.

This post is substantially the same as a talk I delivered at the University of Nebraska on Friday, Feb 8th.

In recent months I’ve had several conversations with colleagues who are friendly to digital methods but wary of claims about novelty that seem overstated. They believe that text mining can add a new level of precision to our accounts of literary history, or add a new twist to an existing debate. They just don’t think it’s plausible that quantification will uncover fundamentally new evidence, or patterns we didn’t previously expect.

If I understand my friends’ skepticism correctly, it’s founded less on a narrow objection to text mining than on a basic premise about the nature of literary study. And where the history of the discipline is concerned, they’re arguably right. In fact, the discipline of literary studies has not usually advanced by uncovering unexpected evidence. As grad students, that’s not what we were taught to aim for. Instead we learned that the discipline moves forward dialectically. You take something that people already believe and “push against” it, or “critique” it, or “complicate” it. You don’t make discoveries in literary study, or if you do they’re likely to be minor — a lost letter from Byron to his tailor. Instead of making discoveries, you make interventions — a telling word.

The broad contours of our discipline are already known, so nothing can grow without displacing something else.

The broad contours of our discipline are already known, so nothing can grow without displacing something else.

So much flows from this assumption. If we’re not aiming for discovery, if the broad contours of literary history are already known, then methodological conversation can only be a zero-sum game. That’s why, when I say “digital methods don’t have to displace traditional scholarship,” my colleagues nod politely but assume it’s insincere happy talk. They know that in reality, the broad contours of our discipline are already known, and anything within those boundaries can only grow by displacing something else.

These are the assumptions I was also working with until about three years ago. But a couple of years of mucking about in digital archives have convinced me that the broad contours of literary history are not in fact well understood.

For instance, I just taught a course called Introduction to Fiction, and as part of that course I talk about the importance of point of view. You can characterize point of view in a lot of subtle ways, but the initial, basic division is between first-person and third-person perspectives.

Suppose some student had asked the obvious question, “Which point of view is more common? Is fiction mostly written in the first or third person? And how long has it been that way?” Fortunately undergrads don’t ask questions like that, because I couldn’t have answered.

I have a suspicion that first person is now used more often in literary fiction than in novels for a mass market, but if you ask me to defend that — I can’t. If you ask me how long it’s been that way — no clue. I’ve got a Ph.D in this field, but I don’t know the history of a basic formal device. Now, I’m not totally ignorant. I can say what everyone else says: “Jane Austen perfected free indirect discourse. Henry James. Focalizing character. James Joyce. Stream of consciousness. Etc.” And three years ago that might have seemed enough, because the bigger, simpler question was obviously unanswerable and I wouldn’t have bothered to pose it.

But recently I’ve realized that this question is answerable. We’ve got large digital archives, so we could in principle figure out how the proportions of first- and third-person narration have changed over time.

You might reasonably expect me to answer that question now. If so, you underestimate my commitment to the larger thesis here: that we don’t understand literary history. I will eventually share some new evidence about the history of narration. But first I want to stress that I’m not in a position to fully answer the question I’ve posed. For three reasons:

    1) Our digital collections are incomplete. I’m working with a collection of about 700,000 18th and 19th-century volumes drawn from HathiTrust. That’s a lot. But it’s not everything that was written in the English language, or even everything that was published.

    2) This is work in progress. For instance, I’ve cleaned and organized the non-serial part of the collection (about 470,000 volumes), but I haven’t started on the periodicals yet. Also, at the moment I’m counting volumes rather than titles, so if a book was often reprinted I count it multiple times. (This could be a feature or a bug depending on your goals.)

    3) Most importantly, we can’t answer the question because we don’t fully understand the terms we’re working with. After all, what is “first-person narration?”

The truth is that the first person comes in a lot of different forms. There are cases where the narrator is also the protagonist. That’s pretty straightforward. Then epistolary novels. Then there are cases where the narrator is anonymous — and not a participant in the action — but sometimes refers to herself as I. Even Jane Austen’s narrator sometimes says “I.” Henry Fielding’s narrator does it a lot more. Should we simply say this is third-person narration, or should we count it as a move in the direction of first? Then, what are we going to do about books like Bleak House? Alternating chapters of first and third person. Maybe we call that 50% first person? — or do we assign it to a separate category altogether? What about a novel like Dracula, where journals and letters are interspersed with news clippings?

Suppose we tried to crowdsource this problem. We get a big team together and decide to go through half a million volumes, first of all to identify the ones that are fiction, and secondly, if a volume is fiction, to categorize the point of view. Clearly, it’s going to be hard to come to agreement on categories. We might get halfway through the crowdsourcing process, discover a new category, and have to go back to the drawing board.

blurrinessNotice that I haven’t mentioned computers at all yet. This is not a problem created by computers, because they “only understand binary logic.” It’s a problem created by us. Distant reading is hard, fundamentally, because human beings don’t agree on a shared set of categories. Franco Moretti has a well-known list of genres, for instance, in Graphs, Maps, Trees. But that list doesn’t represent an achieved consensus. Moretti separates the eighteenth-century gothic novel from the late-nineteenth-century “imperial gothic.” But for other critics, those are two parts of the same genre. For yet other critics, the “gothic” isn’t a genre at all; it’s a mode like tragedy or satire, which is why gothic elements can pervade a bunch of different genres.

This is the darkest moment of this post. It may seem that there’s no hope for literary historians. How can we ever know anything if we can’t even agree on the definitions of basic concepts like genre and point of view? But here’s the crucial twist — and the real center of what I want to say. The blurriness of literary categories is exactly why it’s helpful to use computers for distant reading. With an algorithm, we can classify 500,000 volumes provisionally. Try defining point of view one way, and see what you get. If someone else disagrees, change the definition; you can run the algorithm again overnight. You can’t re-run a crowdsourced cataloguing project on 500,000 volumes overnight.

Second, algorithms make it easier to treat categories as plural and continuous. Although Star Trek teaches us otherwise, computers do not start to stammer and emit smoke if you tell them that an object belongs in two different categories at once. Instead of sorting texts into category A or category B, we can assign degrees of membership to multiple categories. As many as we want. So The Moonstone can be 80% similar to a sensation novel and 50% similar to an imperial gothic, and it’s not a problem. Of course critics are still going to disagree about individual cases. And we don’t have to pretend that these estimates are precise characterizations of The Moonstone. The point is that an algorithm can give us a starting point for discussion, by rapidly mapping a large collection in a consistent but flexibly continuous way.

Then we can ask, Does the gothic often overlap with the sensation novel? What other genres does it overlap with? Even if the boundaries are blurry, and critics disagree about every individual case — even if we don’t have a perfect definition of the term “genre” itself — we’ve now got a map, and we can start talking about the relations between regions of the map.

Can we actually do this? Can we use computers to map things like genre and point of view? Yes, to coin a phrase, we can. The truth is that you can learn a lot about a document just by looking at word frequency. That’s how search engines work, that’s how spam gets filtered out of your e-mail; it’s a well-developed technology. The Stanford Literary Lab suggested a couple of years ago that it would probably work for literary genres as well (see Pamphlet 1), and Matt Jockers has more detailed work forthcoming on genre and diction in Macroanalysis.

There are basically three steps to the process. First, get a training set of a thousand or so examples and tag the categories you want to recognize: poetry or prose, fiction or nonfiction, first- or third-person narration. Then, identify features (usually words) that turn out to provide useful clues about those categories. There are a lot of ways of doing this automatically. Personally, I use a Wilcoxon test to identify words that are consistently common or uncommon in one class relative to others. Finally, train classifiers using those features. I use what’s known as an “ensemble” strategy where you train multiple classifiers and they all contribute to the final result. Each of the classifiers individually uses an algorithm called “naive Bayes,” which I’m not going to explain in detail here; let’s just say that collectively, as a group, they’re a little less “naive” than they are individually — because they’re each relying on slightly different sets of clues.

Confusion matrix from an ensemble of naive Bayes classifiers. (432 test documents held out from a larger sample of 1356.)

Confusion matrix from an ensemble of naive Bayes classifiers. (432 test documents held out from a larger sample of 1356.)

How accurate does this end up being? This confusion matrix gives you a sense. Let me underline that this is work in progress. If I were presenting finished results I would need to run this multiple times and give you an average value. But these are typical results. Here I’ve got a corpus of thirteen hundred nineteenth-century volumes. I train a set of classifiers on two-thirds of the corpus, and then test it by using it to classify the other third of the corpus which it hasn’t yet seen. That’s what I mean by saying 432 documents were “held out.” To make the accuracy calculations simple here, I’ve treated these categories as if they were exclusive, but in the long run, we don’t have to do that: documents can belong to more than one at once.

These results are pretty good, but that’s partly because this test corpus didn’t have a lot of miscellaneous collected works in it. In reality you see a lot of volumes that are a mosaic of different genres — the collected poems and plays of so-and-so, prefaced by a prose life of the author, with an index at the back. Obviously if you try to classify that volume as a single unit, it’s going to be a muddle. But I think it’s not going to be hard to use genre classification itself to segment volumes, so that you get the introduction, and the plays, and the lyric poetry sorted out as separate documents. I haven’t done that yet, but it’s the next thing on my agenda.

One complication I have already handled is historical change. Following up a hint from Michael Witmore, I’ve found that it’s useful to train different classifiers for different historical periods. Then when you get an uncategorized document, you can have each classifier make a prediction, and weight those predictions based on the date of the document.

AbsoluteNumberOfFicVolsSo what have I found? First of all, here’s the absolute number of volumes I was able to identify as fiction in HathiTrust’s collection of eighteenth and nineteenth-century English-language books. Instead of plotting individual years, I’ve plotted five-year segments of the timeline. The increase, of course, is partly just an absolute increase in the number of books published.

RatioBut it’s also an increase specifically in fiction. Here I’ve graphed the number of volumes of fiction divided by the total number of volumes in the collection. The proportion of fiction increases in a straightforward linear way. From 1700-1704, when fiction is only about 5% of the collection, to 1895-99, when it’s 25%. People better-versed in book history may already have known that this was a linear trend, but I was a bit surprised. (I should note that I may be slightly underestimating the real numbers before 1750, for reasons explained in the fine print to the earlier graph — basically, it’s hard for the classifier to find examples of a class that is very rare.)

Features consistently more common in first- or third-person narration, ranked by Mann-Whitney-Wilcoxon rho.

Features consistently more common in first- or third-person narration, ranked by Mann-Whitney-Wilcoxon rho.

What about the question we started with — first-person narration? I approach this the same way I approached genre classification. I trained a classifier on 290 texts that were clearly dominated by first- or third-person narration, and used a Wilcoxon test to select features that are consistently more common in one set or in the other.

Now, it might seem obvious what these features are going to be: obviously, we would expect first-person and third-person pronouns to be the most important signal. But I’m allowing the classifier to include whatever features it in practice finds. For instance, terms for domestic relationships like “daughter” and “husband” and the relative pronouns “whose” and “whom” are also consistently more common in third-person contexts, and oddly, numbers seem more common in first-person contexts. I don’t know why that is yet; this is work in progress and there’s more exploration to do. But for right now I haven’t second-guessed the classifier; I’ve used the top sixteen features in both lists whether they “make sense” or not.

170POVAnd this is what I get. The classifier predicts each volume’s probability of belonging to the class “first person.” That can be anywhere between 0 and 1, and it’s often in the middle (Bleak House, for instance, is 0.54). I’ve averaged those values for each five-year interval. I’ve also dropped the first twenty years of the eighteenth century, because the sample size was so low there that I’m not confident it’s meaningful.

Now, there’s a lot more variation in the eighteenth century than in the nineteenth century, partly because the sample size is smaller. But even with that variation it’s clear that there’s significantly more first-person narration in the eighteenth century. About half of eighteenth-century fiction is first-person, and in the nineteenth century that drops down to about a quarter. That’s not something I anticipated. I expected that there might be a gradual decline in the amount of first-person narration, but I didn’t expect this clear and relatively sudden moment of transition. Obviously when you see something you don’t expect, the first question you ask is, could something be wrong with the data? But I can’t see a source of error here. I’ve cleaned up most of the predictable OCR errors in the corpus, and there aren’t more medial s’s in one list than in the other anyway.

And perhaps this picture is after all consistent with our expectations. Eleanor Courtemanche points out that the timing of the shift to third person is consistent with Ian Watt’s account of the development of omniscience (as exemplified, for instance, in Austen). In a quick twitter poll I carried out before announcing the result, Jonathan Hope did predict that there would be a shift from first-person to third-person dominance, though he expected it to be more gradual. Amanda French may have gotten the story up to 1810 exactly right, although she expected first-person to recover in the nineteenth century. I expected a gradual decline of first-person to around 1810, and then a gradual recovery — so I seem to have been completely wrong.

The ratio between raw counts of first- and third-person pronouns in fiction.

The ratio between raw counts of first- and third-person pronouns in fiction.

Much more could be said about this result. You could decide that I’m wrong to let my classifier use things like numbers and relative pronouns as clues about point of view; we could restrict it just to counting personal pronouns. (That won’t change the result very significantly, as you can see in the illustration on the right — which also, incidentally, shows what happens in those first twenty years of the eighteenth century.) But we could refine the method in many other ways. We could exclude pronouns in direct discourse. We could break out epistolary narratives as a separate category.

All of these things should be tried. I’m explicitly not claiming to have solved this problem yet. Remember, the thesis of this talk is that we don’t understand literary history. In fact, I think the point of posing these questions on a large scale is partly to discover how slippery they are. I realize that to many people that will seem like a reason not to project literary categories onto a macroscopic scale. It’s going to be a mess, so — just don’t go there. But I think the mess is the reason to go there. The point is not that computers are going to give us perfect knowledge, but that we’ll discover how much we don’t know.

For instance, I haven’t figured out yet why numbers are common in first-person narrative, but I suspect it might be because there’s a persistent affinity with travel literature. As we follow up leads like that we may discover that we don’t understand point of view itself as well as we assume.

It’s this kind of complexity that will ultimately make classification interesting. It’s not just about sorting things into categories, but about identifying the places where a category breaks down or has changed over time. I would draw an analogy here to a paper on “Gender in Twitter” recently published by a group of linguists. They used machine learning to show that there are not two but many styles of gender performance on Twitter. I think we’ll discover something similar as we explore categories like point of view and genre. We may start out trying to recognize known categories, like first-person narration. But when you sort a large collection into categories, the collection eventually pushes back on your categories as much as the categories illuminate the collection.

Acknowledgments: This research was supported by the Andrew W. Mellon Foundation through “Expanding SEASR Services” and “The Uses of Scale in Literary Study.” Loretta Auvil, Mike Black, and Boris Capitanu helped develop resources for normalizing 18/19c OCR, many of which are public at Jordan Sellers developed the initial training corpus of 19c documents categorized by genre.

A touching detail produced by LDA …

I’m getting ahead of myself with this post, because I don’t have time to explain everything I did to produce this. But it was just too striking not to share.

Basically, I’m experimenting with Latent Dirichlet Allocation, and I’m impressed. So first of all, thanks to Matt Jockers, Travis Brown, Neil Fraistat, and everyone else who tried to convince me that Bayesian methods are better. I’ve got to admit it. They are.

But anyway, in a class I’m teaching we’re using LDA on a generically diverse collection of 1,853 volumes between 1751 and 1903. The collection includes fiction, poetry, drama, and a limited amount of nonfiction (just biography). We’re stumbling on a lot of fascinating things, but this was slightly moving. Here’s the graph for one particular topic.

Image of a topic.
The circles and X’s are individual volumes. Blue is fiction, green is drama, pinkish purple is poetry, black biography. Only the volumes where this topic turned out to be prominent are plotted, because if you plot all 1,853 it’s just a blurry line at the bottom of the image. The gray line is an aggregate frequency curve, which is not related in any very intelligible way to the y-axis. (Work in progress …) As you can see. this topic is mostly prominent in fiction around the year 1800. Here are the top 50 words in the topic:

But here’s what I find slightly moving. The x’s at the top of the graph are the 10 works in the collection where the topic was most prominent. They include, in order, Mary Wollstonecraft Shelley, Frankenstein, Mary Wollstonecraft, Mary, William Godwin, St. Leon, Mary Wollstonecraft Shelley, Lodore, William Godwin, Fleetwood, William Godwin, Mandeville, and Mary Wollstonecraft Shelley, Falkner.

In short, this topic is exemplified by a family! Mary Hays does intrude into the family circle with Memoirs of Emma Courtney, but otherwise, it’s Mary Wollstonecraft, William Godwin, and their daughter.

Other critics have of course noticed that M. W. Shelley writes “Godwinian novels.” And if you go further down the list of works, the picture becomes less familial (Helen Maria Williams and Thomas Holcroft butt in, as well as P. B. Shelley). Plus, there’s another topic in the model (“myself these should situation”) that links William Godwin more closely to Charles Brockden Brown than it does to his wife or daughter. And LDA isn’t graven on stone; every time you run topic modeling you’re going to get something slightly different. But still, this is kind of a cool one. “Mind feelings heart felt” indeed.

MLA talk: just the thesis.

Giving a talk this morning at the MLA. There are two main arguments:

1) The first one will be familiar if you’ve read my blog. I suggest that the boundary between “text mining” and conventional literary research is far fuzzier than people realize. There appears to be a boundary only because literary scholars are pretty unreflective about the way we’re currently using full-text search. I’m going to press this point in detail, because it’s not just a metaphor: to produce a simple but useful topic-modeling algorithm, all you have to do is take a search engine and run it backwards.

2) The second argument is newer; I don’t think I’ve blogged about it yet. I’m going to present topic modeling as a useful bridge between “distant” and “close” reading. I’ve found that I often learn most about a genre by modeling it as part of a larger collection that includes many other genres. In that context, a topic-modeling algorithm can highlight peculiar convergences of themes that characterize the genre relative to its contemporary backdrop.

a slide from the talk, where a simple topic-modeling algorithm has been used to produce a dendrogram that offers a clue about the temporal framing of narration in late-18c novels

This is distant reading, in the sense that it requires a large collection. But it’s also close reading, in the sense that it’s designed to reveal subtle formal principles that shape individual works, and that might otherwise elude us.

Although the emphasis is different, a lot of the examples I use are recycled from a talk I gave in August, described here.

For most literary scholars, text mining is going to be an exploratory tool.

Having just returned from a conference of Romanticists, I’m in a mood to reflect a bit about the relationship between text mining and the broader discipline of literary studies. This entry will be longer than my usual blog post, because I think I’ve got an argument to make that demands a substantial literary example. But you can skip over the example to extract the polemical thesis if you like!

At the conference, I argued that literary critics already practice a crude form of text mining, because we lean heavily on keyword search when we’re tracing the history of a topic or discourse. I suggested that information science can now offer us a wider range of tools for mapping archives — tools that are subtler, more consonant with our historicism, and maybe even more literary than keyword search is.

At the same time, I understand the skepticism that many literary critics feel. Proving a literary thesis with statistical analysis is often like cracking a nut with a jackhammer. You can do it: but the results are not necessarily better than you would get by hand.

One obvious solution would be to use text mining in an exploratory way, to map archives and reveal patterns that a critic could then interpret using nuanced close reading. I’m finding that approach valuable in my own critical practice, and I’d like to share an example of how it works. But I also want to reflect about the social forces that stand in the way of this obvious compromise between digital and mainstream humanists — leading both sides to assume that quantitative analysis ought to contribute instead by proving literary theses with increased certainty.

Part of a topic tree based on a generically diverse collection of 2200 18c texts.

I’ll start with an example. If you don’t believe text mining can lead to literary insights, bear with me: this post starts with some goofy-looking graphs, but develops into an actual hypothesis about the Romantic novel based on normal kinds of literary evidence. But if you’re willing to take my word that text-mining can produce literary leads, or simply aren’t interested in Romantic-era fiction, feel free to skip to the end of this (admittedly long!) post for the generalizations about method.

Several months ago, when I used hierarchical clustering to map eighteenth-century diction on this blog, I pointed to a small section of the resulting tree that intriguingly mixed language about feeling with language about time. It turned out that the words in this section of the tree were represented strongly in late-eighteenth-century novels (novels, for instance, by Frances Burney, Sophia Lee, and Ann Radcliffe). Other sections of the tree, associated with poetry or drama, had a more vivid kind of emotive language, and I wondered why novels would combine an emphasis on feeling or exclamation (“felt,” “cried”) with the abstract topic of duration (“moment,” “longer”). It seemed an oddly phenomenological way to think about emotion.

But I also realized that hierarchical clustering is a fairly crude way of mapping conceptual space in an archive. The preferred approach in digital humanities right now is topic modeling, which does elegantly handle problems like polysemy. However, I’m not convinced that existing methods of topic modeling (LDA and so on) are flexible enough to use for exploration. One of their chief advantages is that they don’t require the human user to make judgment calls: they automatically draw boundaries around discrete “topics.” But for exploratory purposes boundaries are not an advantage! In exploring an archive, the goal is not to eliminate ambiguity so that judgment calls are unnecessary: the goal is to reveal intriguing ambiguities, so that the human user can make judgments about them.

If this is our goal, it’s probably better to map diction as an associative web. Fortunately, it was easy to get from the tree to a web, because the original tree had been based on an algorithm that measured the strength of association between any two words in the collection. Using the same algorithm, I created a list of twenty-five words most strongly associated with the branch that had interested me (“instantly,” “cried,” “felt,” moment,” “longer,”) and then used the strengths of association between those words to model the whole list as a force-directed graph. In this graph, words are connected by “springs” that pull them closer together; the darker the line, the stronger the association between the two words, and the more tightly they will be bound together in the graph. (The sizes of words are loosely proportional to their frequency in the collection, but only very loosely.)

A graph like this is not meant to be definitive: it’s a brainstorming tool that helps me explore associations in a particular collection (here, a generically diverse collection of eighteenth-century writing). On the left side, we see a triangle of feminine pronouns (which are strongly represented in the same novels where “felt,” “moment,” and so on are strongly represented) as well as language that defines domestic space (“quitting,” “room”). On the right side of the graph, we see a range of different kinds of emotion. And yet, looking at the graph as a whole, there is a clear emphasis on an intersection of feeling and time — whether the time at issue is prospective (“eagerly,” “hastily,” “waiting”) or retrospective (“recollected,” “regret”).

In particular, there are a lot of words here that emphasize temporal immediacy, either by naming a small division of time (“moment,” “instantly”), or by defining a kind of immediate emotional response (“surprise,” “shocked,” “involuntarily”). I have highlighted some of these words in red; the decision about which words to include in the group was entirely a human judgment call — which means that it is open to the same kind of debate as any other critical judgment.

But the group of words I have highlighted in red — let’s call it a discourse of temporal immediacy — does turn out to have an interesting historical profile. We already know that this discourse was common in late-eighteenth-century novels. But we can learn more about its provenance by restricting the generic scope of the collection (to fiction) and expanding its temporal scope to include the nineteenth as well as eighteenth centuries. Here I’ve graphed the aggregate frequency of this group of words in a collection of 538 works of eighteenth- and nineteenth-century fiction, plotted both as individual works and as a moving average. [The moving average won’t necessarily draw a line through the center of the “cloud,” because these works vary greatly in size. For instance, the collection includes about thirty of Hannah More’s “Cheap Repository Tracts,” which are individually quite short, and don’t affect the moving average more than a couple of average-sized novels would, although they create an impressive stack of little circles in the 1790s.]

The shape of the curve here suggests that we’re looking at a discourse that increased steadily in prominence through the eighteenth century and peaked (in fiction) around the year 1800, before sinking back to a level that was still roughly twice its early-eighteenth-century frequency.

Why might this have happened? It’s always a good idea to start by testing the most boring hypothesis — so a first guess might be that words like “moment” and “instantly” were merely displacing some set of close synonyms. But in fact most of the imaginable synonyms for this set of words correlate very closely with them. (This is true, for instance, of words like “sudden,” “abruptly,” and “alarm.”)

Another way to understand what’s going on would be to look at the works where this discourse was most prominent. We might start by focusing on the peak between 1780 and 1820. In this period, the works of fiction where language of temporal immediacy is most prominent include

    Charlotte Dacre, Zofloya (1806)
    Charlotte Lennox, Hermione, or the Orphan Sisters (1791)
    M. G. Lewis, The Monk (1796)
    Ann Radcliffe, A Sicilian Romance (1790), The Castles of Athlin and Dunbayne (1789), and The Romance of the Forest (1792)
    Frances Burney, Cecilia (1782) and The Wanderer (1814)
    Amelia Opie, Adeline Mowbray (1805)
    Sophia Lee, The Recess (1785)

There is a strong emphasis here on the Gothic, but perhaps also, more generally, on women writers. The works of fiction where the same discourse is least prominent would include

    Hannah More, most of her “Cheap Repository Tracts” and Coelebs in Search of a Wife (1809)
    Robert Bage, Hermsprong (1796)
    John Trusler, Life; or the Adventures of William Ramble (1793)
    Maria Edgeworth, Castle Rackrent (1800)
    Arnaud Berquin, The Children’s Friend (1788)
    Isaac Disraeli, Vaurien; or, Sketches of the Times (1797)

Many of these works are deliberately old-fashioned in their approach to narrative form: they are moral parables, or stories for children, or first-person retrospective narratives (like Rackrent), or are told by Fieldingesque narrators who feel free to comment and summarize extensively (as in the works by Disraeli and Trusler).

After looking closely at the way the language of temporal immediacy is used in Frances Burney, Cecilia (1782), and Sophia Lee, The Recess (1785), it seems to me that it had both a formal and an affective purpose.

Formally, it foregrounded a newly sharp kind of temporal framing. If we believe Ian Watt, part of the point of the novel form is to emulate the immediacy of first-hand experience — a purpose that can be slightly at odds with the retrospective character of narrative. Eighteenth-century novelists fought the distancing effect of retrospection in a lot of ways: epistolary narrative, discovered journals and so on are ways of bringing the narrative voice as close as possible to the moment of experience. But those tricks have limits; at some point, if your heroine keeps running off to write breathless letters between every incident, Henry Fielding is going to parody you.

By the late eighteenth century it seems to me novelists were starting to work out ways of combining temporal immediacy with ordinary retrospective narration. Maybe you can’t literally have your narrator describe events as they’re taking place, but you can describe events in a way that highlights their temporal immediacy. This is one of the things that makes Frances Burney read more like a nineteenth-century novelist than like Defoe; she creates a tight temporal frame for each event, and keeps reminding her readers about the tightness of the frame. So, a new paragraph will begin “A few moments after he was gone …” or “At that moment Sir Robert himself burst into the Room …” or “Cecilia protested she would go instantly to Mr Briggs,” to choose a few examples from a single chapter of Cecilia (my italics, 363-71). We might describe this vaguely as a way of heightening suspense — but there are of course many different ways to produce suspense in fiction. Narratology comes closer to the question at issue when it talks about “pacing,” but unless someone has already coined a better term, I think I would prefer to describe this late-18c innovation as a kind of “temporal framing,” because the point is not just that Burney uses “scene” rather than “summary” to make discourse time approximate story time — but that she explicitly divides each “scene” into a succession of discrete moments.

There is a lot more that could be said about this aspect of narrative form. For one thing, in the Romantic era it seems related to a particular way of thinking about emotion — a strategy that heightens emotional intensity by describing experience as if it were divided into a series of instananeous impressions. E.g, “In the cruelest anxiety and trepidation, Cecilia then counted every moment till Delvile came …” (Cecilia, 613). Characters in Gothic fiction are “every moment expecting” some start, shock, or astonishment. “The impression of the moment” is a favorite phrase for both Burney and Sophia Lee. On one page of The Recess, a character “resign[s] himself to the impression of the moment,” although he is surrounded by a “scene, which every following moment threatened to make fatal” (188, my italics).

In short, fairly simple tools for mapping associations between words can turn up clues that point to significant formal, as well as thematic, patterns. Maybe I’m wrong about the historical significance of those patterns, but I’m pretty sure they’re worth arguing about in any case, and I would never have stumbled on them without text mining.

On the other hand, when I develop these clues into a published article, the final argument is likely to be based largely on narratology and on close readings of individual texts, supplemented perhaps by a few simple graphs of the kind I’ve provided above. I suppose I could master cutting-edge natural language processing, in order to build a fabulous tool that would actually measure narrative pace, and the division of scenes into incidents. That would be fun, because I love coding, and it would be impressive, since it would prove that digital techniques can produce literary evidence. But the thing is, I already have an open-source application that can measure those aspects of narrative form, and it runs on inexpensive hardware that requires only water, glucose, and caffeine.

The methodological point I want to make here is that relatively simple forms of text mining, based on word counts, may turn out to be the techniques that are in practice most useful for literary critics. Moreover, if I can speak frankly: what makes this fact hard for us to acknowledge is not technophilia per se, but the nature of the social division between digital humanists and mainstream humanists. Literary critics who want to dismiss text mining are fond of saying “when you get right down to it, it’s just counting words.” (At moments like this we seem to forget everything 20c literary theorists ever learned from linguistics, and go back to treating language as a medium that ideally, ought to be immaterial and transparent. Surely a crudely verbal approach — founded on lumpy, ambiguous words — can never tell us anything about the conceptual subtleties of form and theme!) Stung by that critique, digital humanists often feel we have to prove that our tools can directly characterize familiar literary categories, by doing complex analyses of syntax, form, and sentiment.

I don’t want to rule out those approaches; I’m not interested in playing the game “Computers can never do X.” They probably can do X. But we’re already carrying around blobs of wetware that are pretty good at understanding syntax and sentiment. Wetware is, on the other hand, terrible at counting several hundred thousand words in order to detect statistical clues. And clues matter. So I really want to urge humanists of all stripes to stop imagining that text mining has to prove its worth by proving literary theses.

That should not be our goal. Full-text search engines don’t perform literary analysis at all. Nor do they prove anything. But literary scholars find them indispensable: in fact, I would argue that search engines are at least partly responsible for the historicist turn in recent decades. If we take the same technology used in those engines (a term-document matrix plus vector space math), and just turn the matrix on its side so that it measures the strength of association between terms rather than documents, we will have a new tool that is equally valuable for literary historians. It won’t prove any thesis by itself, but it can uncover a whole new range of literary questions — and that, it seems to me, ought to be the goal of text mining.

Frances Burney, Cecilia; or, Memoirs of an Heiress, ed. Peter Sabor and Margaret Ann Doody (Oxford: OUP, 1999).
Sophia Lee, The Recess; or, a Tale of Other Times, ed. April Alliston (Lexington: UP of Kentucky, 2000).

[Postscript: This post, originally titled “How to make text mining serve literary history,” is a version of a talk I gave at NASSR 2011 in Park City, Utah, sharpened by the discussion that took place afterward. I’d like to thank the organizers of the conference (Andrew Franta and Nicholas Mason) as well as my co-panelists (Mark Algee-Hewitt and Mark Schoenfield) and everyone in the audience. The original slides are here; sometimes PowerPoint can be clearer than prose.

I don’t mean to deny, by the way, that the simple tools I’m using could be refined in many ways — e.g., they could include collocations. What I’m saying is I that don’t think we need to wait for technical refinements. Our text-mining tools are already sophisticated enough to produce valuable leads, and even after we make them more sophisticated, it will remain true that at some point in the critical process we have to get off the bicycle and walk.]