Genre, gender, and point of view.

A paper for the IEEE “big humanities” workshop, written in collaboration with Michael L. Black, Loretta Auvil, and Boris Capitanu, is available on arXiv now as a preprint.

The Institute of Electrical and Electronics Engineers is an odd venue for literary history, and our paper ends up touching so many disciplinary bases that it may be distracting.* So I thought I’d pull out four issues of interest to humanists and discuss them briefly here; I’m also taking the occasion to add a little information about gender that we uncovered too late to include in the paper itself.

1) The overall point about genre. Our title, “Mapping Mutable Genres in Structurally Complex Volumes,” may sound like the sort of impossible task heroines are assigned in fairy tales. But the paper argues that the blurry mutability of genres is actually a strong argument for a digital approach to their history. If we could start from some consensus list of categories, it would be easy to crowdsource the history of genre: we’d each take a list of definitions and fan out through the archive. But centuries of debate haven’t yet produced stable definitions of genre. In that context, the advantage of algorithmic mapping is that it can be comprehensive and provisional at the same time. If you change your mind about underlying categories, you can just choose a different set of training examples and hit “run” again. In fact we may never need to reach a consensus about definitions in order to have an interesting conversation about the macroscopic history of genre.

2) A workset of 32,209 volumes of English-language fiction. On the other hand, certain broad categories aren’t going to be terribly controversial. We can probably agree about volumes — and eventually specific page ranges — that contain (for instance) prose fiction and nonfiction, narrative and lyric poetry, and drama in verse, or prose, or some mixture of the two. (Not to mention interesting genres like “publishers’ ads at the back of the volume.”) As a first pass at this problem, we extract a workset of 32,209 volumes containing prose fiction from a collection of 469,200 eighteenth- and nineteenth-century volumes in HathiTrust Digital Library. The metadata for this workset is publicly available from Illinois’ institutional repository. More substantial page-level worksets will soon be produced and archived at HathiTrust Research Center.

3) The declining prevalence of first-person narration. Once we’ve identified this fiction workset, we switch gears to consider point of view — frankly, because it’s a temptingly easy problem with clear literary significance. Though the fiction workset we’re using is defined more narrowly than it was last February, we confirm the result I glimpsed at that point, which is that the prevalence of first-person point of view declines significantly toward the end of the eighteenth century and then remains largely stable for the nineteenth.

Mean probability that fiction is written in first person, 1700-1899. Based on a corpus of 32,209 volumes of fiction extracted from HathiTrust Digital Library. Points are mean probabilities for five-year spans of time; a trend line with standard errors has been plotted with loess smoothing.

Mean probability that fiction is written in first person, 1700-1899. Based on a corpus of 32,209 volumes of fiction extracted from HathiTrust Digital Library. Points are mean probabilities for five-year spans of time; a trend line with standard errors has been plotted with loess smoothing.


We can also confirm that result in a way I’m finding increasingly useful, which is to test it in a collection of a completely different sort. The HathiTrust collection includes reprints, which means that popular works have more weight in the collection than a novel printed only once. It also means that many volumes carry a date much later than their first date of publication. In some ways this gives a more accurate picture of print culture (an approximation to “what everyone read,” to borrow Scott Weingart’s phrase), but one could also argue for a different kind of representativeness, where each volume would be included only once, in a record dated to its first publication (an attempt to represent “what everyone wrote”).

Mean probability that fiction is written in first person, 1700-1899. Based on a corpus of 774 volumes of fiction selected by multiple hands from multiple sources. Plotted in 20-year bins because n is small here. Works are weighted by the number of words they contain.

Mean probability that fiction is written in first person, 1700-1899. Based on a corpus of 774 volumes of fiction selected by multiple hands from multiple sources. Plotted in 20-year bins because n is smaller here. Works are weighted by the number of words they contain.


Fortunately, Jordan Sellers and I produced a collection like that a few years ago, and we can run the same point-of-view classifier on this very different set of 774 fiction volumes (metadata available), selected by multiple hands from multiple sources (including TCP-ECCO, the Brown Women Writers Project, and the Internet Archive). Doing that reveals broadly the same trend line we saw in the HathiTrust collection. No collection can be absolutely representative (for one thing, because we don’t agree on what we ought to be representing). But discovering parallel results in collections that were constructed very differently does give me some confidence that we’re looking at a real trend.

4. Gender and point of view. In the process of classifying works of fiction, we stumbled on interesting thematic patterns associated with point of view. Features associated with first-person perspective include first-person pronouns, obviously, but also number words and words associated with sea travel. Some of this association may be explained by the surprising persistence of a particular two-century-long genre, the Robinsonade. A castaway premise obviously encourages first-person narration, but the colonial impulse in the Robinsonade also seems to have encouraged acquisitive enumeration of the objects (goats, barrels, guns, slaves) its European narrators find on ostensibly deserted islands. Thus all the number words. (But this association of first-person perspective with colonial settings and acquisitive enumeration may well extend beyond the boundaries of the Robinsonade to other genres of adventure fiction.)

Third-person perspective, on the other hand, is durably associated with words for domestic relationships (husband, lover, marriage). We’re still trying to understand these associations; they could be consequences of a preference for third-person perspective in, say, courtship fiction. But third-person pronouns correlate particularly strongly with words for feminine roles (girl, daughter, woman) — which suggests that there might also be a more specifically gendered dimension to this question.

Since transmitting our paper to the IEEE I’ve had a chance to investigate this hypothesis in the smaller of the two collections we used for that paper — 774 works of fiction between 1700 and 1899: 521 by men, 249 by women, and four not characterized by gender. (Mike Black and Jordan Sellers recorded this gender data by hand.) In this collection, it does appear that male writers choose first-person perspective significantly more than women do. The gender gap persists across the whole timespan, although it might be fading toward the end of the nineteenth century.

Proportion of works of fiction by men and women in first person. Based on the same set of 774 volumes described above. (This figure counts strictly by the number of works rather than weighting works by the number of words they contain.)

Proportion of works of fiction by men and women in first person. Based on the same set of 774 volumes described above. (This figure counts strictly by the number of works rather than weighting works by the number of words they contain.)


Over the whole timespan, women use first person in roughly 23% of their works, and men use it in roughly 35% of their works.** That’s not a huge difference, but in relative terms it’s substantial. (Men are using first person 52% more than women). The Bayesian mafia have made me wary of p-values, but if you still care: a chi-squared test on the 2×2 contingency table of gender and point of view gives p < 0.001. (Attentive readers may already be wondering whether the decline of first person might be partly explained by an increase in the proportion of women writers. But actually, in this collection, works by women have a distribution that skews slightly earlier than that of works by men.)

These are very preliminary results. 774 volumes is a small set when you could test 32,209. At the recent HTRC Uncamp, Stacy Kowalczyk described a method for gender identification in the larger HathiTrust corpus, which we will be eager to borrow once it’s published. Also, the mere presence of an association between gender and point of view doesn’t answer any of the questions literary critics will really want to pose about this phenomenon — like, why is point of view associated with gender? Is this actually a direct consequence of gender, or is it an indirect consequence of some other variable like genre? Does this gendering of narrative perspective really fade toward the end of the nineteenth century? I don’t pretend to have answered any of those questions, all I’m doing here is flagging the existence of an interesting open question that will deserve further inquiry.

— — — — —

*Other papers for the panel are beginning to appear online. Here’s “Infectious Texts: Modeling Text Reuse in Nineteenth-Century Newspapers,” by David A. Smith, Ryan Cordell, and Elizabeth Maddock Dillon.

** We don’t actually represent point of view as a binary choice between first person or third person; the classifier reports probabilities as a continuous range between 0 and 1. But for purposes of this blog post I’ve simplified by dividing the works into two sets at the 0.5 mark. On this point, and for many other details of quantitative methodology, you’ll want to consult the paper itself.

Hold on loosely; or, Gemeinschaft and Gesellschaft on the web.

I want to try a quick experiment.

The digital humanities community must …

If that sounds like a plausible beginning to a sentence, what about this one?

The literary studies community must …

Does that sound as odd to you as it does to me? No one pretends literary studies is a community. In the U.S., the discipline becomes visible to itself mainly at the spectacular, but famously alienating, yearly ritual of the MLA. A hotel that contains disputatious full professors and brilliant underemployed jobseekers may be many interesting things, but “community” is not the first word that comes to mind.

“Digital humanities,” on the other hand, frequently invokes itself as a “community.” The reasons may stretch back into the 90s, and to the early beleaguered history of humanities computing. But the contemporary logic of the term is probably captured by Matt Kirschenbaum, who stresses that the intellectually disparate projects now characterized as DH are unified above all by reliance on social media, especially Twitter.

In many ways that’s a wonderful thing. Twitter is not a perfectly open form, and it’s certainly not an egalitarian one; it has a one-to-many logic. But you don’t have to be a digital utopian to recognize that academic fields benefit from frequent informal contact among their members — what Dan Cohen has described as “the sidewalk life of successful communities.” Twitter is especially useful for establishing networks that cross disciplinary (and professional) boundaries; I’ve learned an amazing amount from those networks.

On the other hand, the illusion of open and infinitely extensible community created by Twitter has some downsides. Ferdinand Tönnies’s distinction between Gemeinschaft and Gesellschaft may not describe all times and places well, but I find it useful here as a set of ideal types. A Gemeinschaft (community) is bound together by personal contact among members and by shared implicit values. It may lack formal institutions, so its members have to be restrained by moral suasion and peer pressure. A Gesellschaft (society) doesn’t expect all its members to share the same values; it expects them to be guided mostly by individual aims, restrained and organized by formal institutions.

Given that choice, wouldn’t everyone prefer to live in cozy Gemeinschaft? Well, sure, except … remember you’re going to have to agree on a set of values! Digital humanists have spent a lot of time discussing values (Lisa Spiro, “Why We Fight”), but as the group gets larger that discussion may prove quite difficult. In the humanities, disagreeing about values is part of our job. It may be just one part of the job in humanities computing, which has a collaborative emphasis. But disagreeing about values has been almost the whole job in more traditional precincts of the humanities. As DH expands, that difference creates yet another layer of disagreement — a meta-struggle over meta-values labeled “hack” and “yack.”

But you know that. Why am I saying all this? I hope the frame I’m offering here is a useful way to understand the growing pains of a web-mediated academic project. DH has at times done a pretty good imitation of Gemeinschaft, but as it gets bigger it’s necessarily going to become more Geselle-y. Which may sound sadder than it is; here’s where I invoke the title of this post. Academic community doesn’t have to be impersonal, but in the immortal words of .38 Special, we need to give each other “a whole lot of space to breathe in.”

This may involve consciously bracketing several values that we celebrate in other contexts. For instance, the centrifugal logic of a growing field isn’t a problem that can be solved by “niceness.” Resolving academic debates by moral suasion on Twitter is not just a bad idea because it produces flame wars. It would be an even worse idea if it worked — because we don’t really want an academic project to have that kind of consensus, enforced by personal ties and displays of collective solidarity.

On the other hand, the values of “candor” and “open debate” may be equally problematic on the web. Filter bubbles have their uses. I want to engage all points of view, but I can’t engage them all at one-hour intervals.

An open question that I can’t answer concerns the role of Twitter here. I’ve found it enormously valuable, both as a latecomer to “DH,” and as an interested lurker in several other fields (machine learning, linguistics, computational social science). I also find it personally enjoyable. But it’s possible that Twitter will just structurally tempt humanists into attempting a more cohesive, coercive kind of Gemeinschaft than academic social networks can (or should) sustain. It’s also possible that we’ll see a kind of cyclic logic here, where Twitter remains valuable for newcomers but tends to become a drain on the time and energy of scholars who already have extensive networks in a field. I don’t know.

Postscript a few hours later: The best reflection on the “cyclic logic” of academic projects online is still Bethany Nowviskie’s “Eternal September of the Digital Humanities,” which remains strikingly timely even after the passage of (gasp) three years.