Do topic models warp time?

Recently, historians have been trying to understand cultural change by measuring the “distances” that separate texts, songs, or other cultural artifacts. Where distances are large, they infer that change has been rapid. There are many ways to define distance, but one common strategy begins by topic modeling the evidence. Each novel (or song, or political speech) can be represented as a distribution across topics in the model. Then researchers estimate the pace of change by measuring distances between topic distributions.

In 2015, Mauch et al. used this strategy to measure the pace of change in popular music—arguing, for instance, that changes linked to hip-hop were more dramatic than the British invasion. Last year, Barron et al. used a similar strategy to measure the influence of speakers in French Revolutionary debate.

I don’t think topic modeling causes problems in either of the papers I just mentioned. But these methods are so useful that they’re likely to be widely imitated, and I do want to warn interested people about a couple of pitfalls I’ve encountered along the road.

One reason for skepticism will immediately occur to humanists: are human perceptions about difference even roughly proportional to the “distances” between topic distributions? In one case study I examined, the answer turned out to be “yes,” but there are caveats attached. Read the paper if you’re curious.

In this blog post, I’ll explore a simpler and weirder problem. Unless we’re careful about the way we measure “distance,” topic models can warp time. Time may seem to pass more slowly toward the edges of a long topic model, and more rapidly toward its center.

For instance, suppose we want to understand the pace of change in fiction between 1885 and 1984. To make sure that there is exactly the same amount of evidence in each decade, we might randomly select 750 works in each decade, and reduce each work to 10,000 randomly sampled words. We topic-model this corpus. Now, suppose we measure change across every year in the timeline by calculating the average cosine distance between the two previous years and the next two years. So, for instance, we measure change across the year 1911 by taking each work published in 1909 or 1910, and comparing its topic proportions (individually) to every work published in 1912 or 1913. Then we’ll calculate the average of all those distances. The (real) results of this experiment are shown below.

firstdiscovery

Perhaps we’re excited to discover that the pace of change in fiction peaks around 1930, and declines later in the twentieth century. It fits a theory we have about modernism! Wanting to discover whether the decline continues all the way to the present, we add 25 years more evidence, and create a new topic model covering the century from 1910 to 2009. Then we measure change, once again, by measuring distances between topic distributions. Now we can plot the pace of change measured in two different models. Where they overlap, the two models are covering exactly the same works of fiction. The only difference is that one covers a century (1885-1984) centered at 1935, and the other a century (1910-2009) centered at 1960.

seconddiscovery

But the two models provide significantly different pictures of the period where they overlap. 1978, which was a period of relatively slow change in the first model, is now a peak of rapid change. On the other hand, 1920, which was a point of relatively rapid change, is now a trough of sluggishness.

Puzzled by this sort of evidence, I discussed this problem with Laure Thompson and David Mimno at Cornell, who suggested that I should run a whole series of models using a moving window on the same underlying evidence. So I slid a 100-year window across the two centuries from 1810 to 2009 in five 25-year steps. The results are shown below; I’ve smoothed the curves a little to make the pattern easier to perceive.

timewarp

The models don’t agree with each other well at all. You may also notice that all these curves are loosely n-shaped; they peak at the middle and decline toward the edges (although sometimes to an uneven extent). That’s why 1920 showed rapid change in a model centered at 1935, but became a trough of sloth in one centered at 1960. To make the pattern clearer we can directly superimpose all five models and plot them on an x-axis using date relative to the model’s timeline (instead of absolute date).

rainbow

The pattern is clear: if you measure the pace of change by comparing documents individually, time is going to seem to move faster near the center of the model. I don’t entirely understand why this happens, but I suspect the problem is that topic diversity tends to be higher toward the center of a long timeline. When the modeling process is dividing topics, phenomena at the edges of the timeline may fall just below the threshold to form a distinct topic, because they’re more sparsely represented in the corpus (just by virtue of being near an edge). So phenomena at the center will tend to be described with finer resolution, and distances between pairs of documents will tend to be greater there. (In our conversation about the problem, David Mimno ran a generative simulation that produced loosely similar behavior.)

To confirm that this is the problem, I’ve also measured the average cosine distance, and Kullback-Leibler divergence, between pairs of documents in the same year. You get the same n-shaped pattern seen above. In other words, the problem has nothing to do with rates of change as such; it’s just that all distances tend to be larger toward the center of a topic model than at its edges. The pattern is less clearly n-shaped with KL divergence than with cosine distance, but I’ve seen some evidence that it distorts KL divergence as well.

But don’t panic. First, I doubt this is a problem with topic models that cover less than a decade or two. On a sufficiently short timeline, there may be no systematic difference between topics represented at the center and at the edges. Also, this pitfall is easy to avoid if we’re cautious about the way we measure distance. For instance, in the example above I measured cosine distance between individual pairs of documents across a 5-year period, and then averaged all the distances to create an “average pace of change.” Mathematically, that way of averaging things is slighly sketchy, for reasons Xanda Schofield explained on Twitter:

xanda

The mathematics of cosine distance tend to work better if you average the documents first, and then measure the cosine between the averages (or “centroids”). If you take that approach—producing yearly centroids and comparing the centroids—the five overlapping models actually agree with each other very well.

timeunwarped

Calculating centroids factors out the n-shaped pattern governing average distances between individual books, and focuses on the (smaller) component of distance that is actually year-to-year change. Lines produced this way agree very closely, even about individual years where change seems to accelerate. As substantive literary history, I would take this evidence with a grain of salt: the corpus I’m using is small enough that the apparent peaks could well be produced by accidents of sampling. But the math itself is working.

I’m slightly more confident about the overall decline in the pace of change from the nineteenth century to the twenty-first. Although it doesn’t look huge on this graph, that pattern is statistically quite strong. But I would want to look harder before venturing a literary interpretation. For instance, is this pattern specific to fiction, or does it reflect a broadly shared deceleration in underlying rates of linguistic change? As I argued in a recent paper, supervised models may be better than raw distance measures at answering that culturally-specific question.

But I’m wandering from the topic of this post. The key observation I wanted to share is just that topic models produce a kind of curved space when applied to long timelines; if you’re measuring distances between individual topic distributions, it may not be safe to assume that your yardstick means the same thing at every point in time. This is not a reason for despair: there are lots of good ways to address the distortion. But it’s the kind of thing researchers will want to be aware of.

 

The Gender Balance of Fiction, 1800-2007

by Ted Underwood and David Bamman

Last year, we wrote a blog post that posed questions about the differentiation of gendered roles in fiction. In doing that, we skipped over a more obvious question: how equally (or unequally) do stories distribute their attention between men and women?

This year, we’re returning to that simple question, with a richer dataset (supported by ongoing work at HathiTrust Research Center). The full story will come out in an article, but we’d like to share a few big-picture points in advance.

To start with, why have we framed this as a question about “women” and “men”? Gender isn’t a binary phenomenon. But we aren’t inquiring about the truth of gender identity here — just about gross inequalities that have separated conventional public roles. English-language fiction does typically divide characters by calling them “he” or “she,” and that division is a good place to start posing questions.

We could measure underrepresentation by counting people, but then we’d have to decide how much weight to give minor characters. A simpler approach is just to ask how many words are used to describe fictional men or women, respectively. BookNLP gave us a way to answer that question; it uses names and honorifics to infer a character’s gender, and then traces grammatical dependencies to identify adjectives that modify a character, nouns she possesses, or verbs she governs. After swinging BookNLP through 93,708 English-language volumes identified as fiction from the HathiTrust Digital Library, we can estimate the percentage of words used in characterization that are used to describe women. (To simplify the task of reading this illustration, we have left out characters coded as “other” or unknown,” so a year with equal representation of men and women would be located on the 50% line.).  To help quantify our uncertainty, we present each measurement by year along with a 95% confidence interval calculated using the bootstrap; our uncertainty decreases over time, largely as a function of an increasing number of books being published.

fig1

There is a clear decline from the nineteenth century (when women generally take up 40% or more of the “character space” in fiction) to the 1950s and 60s, when their prominence hovers around a low of 30%. A correction, beginning in the 1970s, almost restores fiction to its nineteenth-century state. (One way of thinking about this: second-wave feminism was a desperately-needed rescue operation.)

The fluctuation is not enormous, but also not trivial: women lose roughly a fourth of the space on the page they had possessed in the nineteenth century. Nor is this something we already knew. It might be a mistake to call this pattern a “surprise”: it’s not as if everyone had clearly-formed expectations about “space on the page.” But when we do pose the question, and ask scholars what they expect to see before revealing this evidence, several people have predicted a series of advances toward equality that correspond to e.g. the suffrage movement and World War II, separated by partial retreats. Instead we see a fairly steady decline from 1860 to 1970, with no overall advance toward equality.

What’s the explanation? Our methods do have blind spots. For instance, we aren’t usually able to infer gender for first-person protagonists, so they are left out here. And our inferences about other characters have a known level of error. But after cross-checking the evidence, we don’t believe the level of error is large enough to explain away this pattern (see our github repo for fuller discussion). It is of course possible that our sample of fiction is skewed. For instance, a sample of 93,708 volumes will include a lot of obscure works and works in translation. What if we focus on slightly more prominent works? We have posed that question by comparing our Hathi sample to a smaller (10,000-volume) sample drawn from the Chicago Text Lab, which emphasizes relatively prominent American works, and filters out works in translation.

fig2_chicago

As you can see, the broad outlines of the trend don’t change. If anything, the decline from 1860 to 1970 is slightly more marked in the Chicago corpus (perhaps because it does a better job of filtering out reprints, which tend to muffle change). This doesn’t prove that we will see the same pattern in every sample. There are many ways to sample the history of fiction! Some scholars will want to know about paperbacks that tend to be underrepresented in university libraries; others will only be interested in a short list of hypercanonical authors. We can’t exhaust all possible modes of sampling, but we can say at least that this trend is not an artefact of a single sampling strategy.  Nor is it an artefact of our choice to represent characters by counting words syntactically associated with them: we see the same pattern of decline to different degrees when measuring the amount of dialogue spoken by men and women, and in simply counting the number of characters as well.

So what does explain the declining representation of women? We don’t yet know. But the trend seems too complex to dismiss with a single explanation. For instance, it can be partly — but only partly — explained by a decline in the proportion of fiction writers who were women.

author

Take specific dots with a grain of salt; there are sources of error here, especially because the wall of copyright at 1923 may change digitization practices or throw off our own data pipeline. (Note the outlier right at 1923.) But the general pattern above is echoed also in the Chicago sample of American fiction, so we feel confident that there was really a decline in the fraction of fiction writers who were women. As far as we know, Chris Forster was the first person to gather broad quantitative evidence of this decline. But many scholars have grasped pieces of the story: for instance, Anne E. Boyd takes The Atlantic around 1890 as a case study of a process whereby the professionalization and canonization of American fiction tended to push out women who had previously been prominent. [See also Tuchman and Fortin 1989 in references below.]

But this is not necessarily a story about the marginalization of women writers in general. (On the contrary, the prominence of women rose throughout this period in several nonfiction genres.) The decline was specific to fiction — either because the intellectual opportunities open to women were expanding beyond belles lettres, or because the rising prestige of fiction attracted a growing number of men.

Men are overrepresented in books by men, so a decline in the number of women novelists will also tend to reduce the number of characters who are women. But that doesn’t completely explain the marginalization of feminine characters from 1860 to 1970. For instance, we can also divide authors by gender, and look at shifting patterns of attention within works by women or by men.

by_author_gender

There are several interesting details here. The inequality of attention in books by men is depressingly durable (men rarely give more than 30% of their attention to fictional women). But it’s also interesting that the fluctuations we saw earlier remain visible even when works are divided by author gender: both trend lines above show a slight decline in the space allotted to women, from 1860 to 1970. In other words, it’s not just that there were fewer works of fiction written by women; even inside books written by women, feminine characters were occupying slightly less space on the page.

Why? The rise of genres devoted to “action” and “adventure” might play a role, although we haven’t found clear evidence yet that it makes a difference. (Genre boundaries are too blurry for the question to be answered easily.) Or fiction might have been masculinized in some broader sense, less tied to specific genre categories (see Suzanne Clark, for instance, on modernism as masculinization.)

But listing possible explanations is the easy part. Figuring out which are true — and to what extent — will be harder.

We will continue to explore these questions, in collaboration with grad students, but we also want to draw other scholars’ attention to resources that can support this kind of inquiry (and invite readers to share useful secondary sources in the comments).

HathiTrust Research Center’s Extracted Features Dataset doesn’t permit the syntactic parsing performed by BookNLP, but even authors’ names and the raw frequencies of gendered pronouns can tell you a lot. Working just with that dataset, Chris Forster was able to catch significant patterns involving gender.

When we publish our article, we will also share data produced by BookNLP about specific characters across a collection of 93,708 books. HTRC is also building a “Data Capsule” that will allow other scholars to produce similar data themselves. In the meantime, in collaboration with Nikolaus N. Parulian, we have produced an interactive visualization that allows you to explore changes in the gendering of words used in characterization. (Compare, for instance, “grin” to “smile,” or “house” to “room.”) We have also made available the metadata and yearly summaries behind the visualization.

Acknowledgments. The work described here has been supported by NovelTM, funded by the Canadian Social Sciences and Humanities Research Council, and by the WCSA+DC grant at HathiTrust Research Center, funded  by the Andrew W. Mellon Foundation. We thank Hoyt Long, Teddy Roland, and Richard Jean So for permission to use the Chicago Novel Corpus. The project often relied on GenderID.py, by Bridget Baird and Cameron Blevins (2014). Boris Capitanu helped parallelize BookNLP across hundreds of thousands of volumes. Attendees at the 2016 NovelTM meeting, and Justine Murison in Illinois, provided valuable advice about literary history.

References.

Boyd, Anne E. “‘What, Has She Got into the Atlantic?’ Women Writers, The Atlantic Monthly, and the Formation of the American Canon,” American Studies 39.3 (1998): 5-36.

Clark, Suzanne. Sentimental Modernism: Women Writers and the Revolution of the Word (Indianapolis: Indiana University Press, 1992).

Forster, Chris. “A Walk Through the Metadata: Gender in the HathiTrust Dataset.” September 8, 2015. http://cforster.com/2015/09/gender-in-hathitrust-dataset/

Tuchman, Gaye, with Nina E. Fortin. Edging Women Out: Victorian Novelists, Publishers, and Social Change. New Haven: Yale University Press, 1989.

 

 

 

A distant reading of moments

I delivered a talk about time at the English Institute yesterday. Since it could easily be a year before the print version comes out, I thought I would share the draft as a working paper.

The argument has two layers. On one level it’s about the tension between distant reading and New Historicism. The New Historical anecdote fuses history with literary representation in a vivid, influential way, by compressing a large theme into a brief episode. Can quantitative arguments about the past aspire to the same kind of compression and vividness?

Inside that metacritical frame, there’s a history of narrative pace, based on evidence I gathered in collaboration with Sabrina Lee and Jessica Mercado. (We’re also working on a separate co-authored piece that will dive more deeply into this data.)

We ask how much fictional time is narrated, on average, in 250 words. We discover some dramatic changes across a timeline of 300 years, and I’m tempted to include our results as an illustration here. But I’ve decided not to, because I want to explore whether scholars already, intuitively know how the representation of duration has changed, by asking readers to reflect for a moment on what they expect to see.

So instead of illustrating this post with real evidence, I’ve provided a plausible, counterfactual illustration based on an account of duration that one might extract from influential narratological works by Gérard Genette or Seymour Chatman.

blamemodernismuse

Artificial data, generated to simulate the account of narrative pace one might extract from Gérard Genette, Narrative Discourse. Logarithmic scale.

To find out what the real story is, you’ll have to read the paper, “Why Literary Time Is Measured in Minutes.”

(Open data and code aren’t out yet, but they will be released with our co-authored essay.)

The instability of gender

Ted Underwood and David Bamman

1500-word abstract of a paper delivered Sat, Jan 9th, at MLA 2016, in a panel with Deidre Lynch and Andrew Piper. (An article based on this research, and further research with Sabrina Lee, will appear in Cultural Analytics in early 2018.)

helpfulBy visualizing course evaluations, Ben Schmidt has reminded us how subtly (and irrationally) descriptions of real people are shaped by gendered expectations. Men are praised for being funny, and condemned for being boring. Women are praised for being helpful, and condemned for being strict.

Fictional characters are never simply imagined people; they’re also aspects of novelistic form (Lynch 1998). But gendered patterns of description do appear in fiction, and it might be interesting to know how those patterns have changed. This also happens to be a problem where natural language processing can help us, since English pronouns have grammatical gender. (The gender of “me” is a trickier problem; for the purposes of this paper, we have regretfully set first-person narrators aside.)

We used BookNLP (a pipeline developed in Bamman et al. 2014a) to identify characters and the words connected to them. We applied it to 45,000 works of fiction distributed (unevenly) over the period 1780-1989. (The works themselves were partly drawn from HathiTrust and partly located at the Chicago Text Lab.) BookNLP does make errors (Vala et al., 2015), and any analysis on this scale will miss a great deal that is implied rather than said. But readers are so interested in character that it may be worth putting up with some gaps and uncertainties in order to glimpse broad historical patterns.

We asked, first, how strongly characterization is shaped by gender, and how that pressure waxed or waned across time. For instance, if you didn’t have names or pronouns, or tautological clues like “her Ladyship” and “her girlhood,” how easy would it be to infer a character’s (grammatical) gender from the apparently-genderless verbs, nouns, and adjectives associated with her?

One way to find out is to train a model to predict gender just from those implicit clues, testing it against the ground truth established by pronouns. When we do this, a long-term trend is perceptible: the linguistic differences between male and female characters get clearer to the middle of the nineteenth century, and then slowly get blurrier, through at least the 1980s.

Boxplots for 12 regularized logistic models in each decade; each model included 750 male and 750 female characters, randomly selected with the proviso that the median character size was always 51 words, and characters with less than 15 words were excluded.

Boxplots for 12 regularized logistic models in each decade; each model included 750 male and 750 female characters, randomly selected with the proviso that the median character size was always 51 words, and characters with less than 15 words were excluded.

It’s not a huge or dramatic shift, partly because gender is never easy to infer in the first place. (Since the model could get 50% of the characters right by guessing randomly, 74% is not eagle-eyed. Of course, the median character was only associated with 51 words, which is not a lot of evidence to go on.)

There are also questions about the data that make it difficult to be confident about details. We have sparse data before 1810, so we’re not certain yet that gender was really less clearly marked in the eighteenth century — although Virginia Woolf does tell us that “the sexes drew further and further apart” as the nineteenth century began (Woolf 1992: 219).

Also, after 1923, our dataset gets a little more American and a little better at excluding reprints, so the apparent acceleration of change from 1910 to 1930 might partly reflect changes in the corpus. In the final draft, we plan to check multiple corpora against each other. But we don’t have much doubt about the broad trend from 1840 to 1989. Over that century and a half, the boundary that separates “men” and “women” in fiction does seem to get blurrier and blurrier.

What were the tacit patterns that made it possible to predict a character’s gender in the first place, and how did they change? That’s a big question; there’s room here for several decades of discussion.

But some of the broadest patterns are easy to grasp. For each word, you can measure the difference between its frequency in descriptions of women and of men. (In the graphs below, words above zero are more common in descriptions of women.) Then you can sort the words to find ones where the difference between genders is large early in the period, and declines over time.

heartmindWhen you do that, you find a lot of words that describe subjective consciousness and emotion; most of them are attributed to women. “Passion” is an exception used more often for men; of course, in the early nineteenth century, it often means “lust.”

This evidence tends to support Nancy Armstrong’s contention in Desire and Domestic Fiction that subjectivity was to begin with “a female domain” in the novel (Armstrong 4), although it puts the peak of this phenomenon a little later than she suggests.

But in general, the gendering of subjectivity is a pattern that will be familiar to scholars of the novel. So, probably, is the tension between public and private space revealed here. Throughout the nineteenth century, it’s “her chamber” and “her room,” but “his country.” Around 1925, houses switch owners.

roominhouse

The convergence of all these lines on the right side of the graph helps explain why our models find gender harder and harder to predict: many of the words you might use to predict it are becoming less common (or becoming more evenly balanced between men and women — the graphs we’ve presented here don’t yet distinguish those two sorts of change.) On balance, that’s the prevailing trend. But there are also a few implicitly gendered forms of description that do increase. In particular, physical description becomes more important in fiction (Heuser and Le-Khac 2012).

From the Famous Artists' School course materials. "The male head is square and angular, with a strong jaw."

From the Famous Artists’ School course materials. “The male head is square and angular, with a strong jaw.”

And as writers spend more time describing their characters physically, some aspects of the body and dress also become more important as signifiers of gender. This isn’t a simple, monolithic process. There are parts of the body whose significance seems to peak at a certain date and then level off — like the masculine jaw, maybe peaking around 1950?

jawchest

Other signifiers of masculinity — like the chest, and incidentally pockets — continue to become more and more important. For women, the “eyes” and “face” peak very markedly around 1890. But hair has rarely been more gendered (or bigger) than it was in the 1980s.

eyeshairface

The measures we’re using here are simple, and deliberately conflate sheer frequency with gendered-ness in order to highlight words that have both attributes. We may use a wider range of interpretive strategies in the final article. But it’s clear already that gender has been unstable, not just because the implicit gendering of characterization became blurrier overall from 1840 to 1989 — but because the specific clues associated with gender have been rather volatile. In other words, gender is not at all the same thing in 1980 that it was in 1840.

There’s nothing very novel about the discovery that gender is fluid. But of course, we like to say everything is fluid: genres, roles, geographies. The advantage of a comparative method is that it lets us say specifically what we mean. Fluid compared to what? For instance, the increasing blurriness of gender boundaries is a kind of change we don’t see when we model the boundary between detective fiction and other genres: that boundary remains remarkably stable from 1841 to 1989. So we can say the linguistic signs of gender in characterization are more mutable than at least some genres.

We didn’t have to start with a complex data model to find this fluidity. Our initial representation of gender was a naive binary one, borrowed casually from English grammar. But we still ended up discovering that the things associated with those binary reference points have been in practice very changeable.

Other approaches are possible. The model Underwood has used to define genre (in a forthcoming piece) is messy and perspectival from the get-go, patched together from different sources of testimony. A project working with appropriate kinds of evidence could, similarly, build a perspectival dimension into definitions of gender from the very outset (for inspiration see Posner 2015 and Bamman et al. 2014b). But the point of research is also to discover things that weren’t hard-coded in the original plan. Even a perspectival model of genre may end up finding that different sources actually agree, for instance, about the boundaries of detective fiction. Conversely, even naively grammatical gender categories may start to bend and blur if they’re stretched across a two-century timeline.

Acknowledgements. This project was made possible by generous support from the NovelTM project, funded by the Social Sciences and Humanities Research Council. The authors would like to acknowledge work in progress at NovelTM as an influence on their thinking, including especially a forthcoming project by Matthew L. Jockers and Gabi Kirilloff. Our models of the twentieth century depend on collections located at the Chicago Text Lab, and supported by the University of Chicago Knowledge Lab. Eleanor Courtemanche suggested the connection to Woolf. BookNLP is available on github; work planned for this year at HathiTrust Research Center will make it possible for scholars to apply it to fiction even beyond the wall of copyright.

References:

Armstrong, Nancy. 1987. Desire and Domestic Fiction: A Political History of the Novel. New York: Oxford University Press.

Bamman, David, Ted Underwood, and Noah Smith. 2014a. “A Bayesian mixed-effects model of literary character.” ACL 2014. http://www.ark.cs.cmu.edu/literaryCharacter/

Bamman, David, Jacob Eisenstein, and Tyler Schnoebelen. 2014b. Gender Identity and Lexical Variation in Social Media. Journal of Sociolinguistics 18, 2 (2014).

Heuser, Ryan, and Long Le-Khac. 2012. “A Quantitative Literary History of 2,958 Nineteenth-Century British Novels: The Semantic Cohort Method.” Stanford Literary Lab Pamphlet Series. http://litlab.stanford.edu/pamphlets/ May 2012.

Lynch, Deidre. The Economy of Character: Novels, Market Culture, and the Business of Inner Meaning. Chicago: The University of Chicago Press, 1998.

Posner, Miriam. 2015. “What’s Next: The Radical, Unrealized Potential of Digital Humanities.” http://miriamposner.com/blog/whats-next-the-radical-unrealized-potential-of-digital-humanities/

Schmidt, Benjamin. 2015. “Gendered language in teaching reviews.” http://benschmidt.org/profGender/

Vala, Hardik, David Jurgens, Andrew Piper, and Derek Ruths. 2015. “Mr Bennet, his Coachman, and the Archibishop Walk into a Bar, but only One of them Gets Recognized.” CEMNLP. http://cs.stanford.edu/~jurgens/docs/vala-jurgens-piper-ruths_emnlp_2015.pdf

Woolf, Virginia. 1992. Orlando: A Biography, ed. Rachel Bowlby. Oxford: Oxford University Press.

Free research question about plot.

I think the whole syuzhet controversy is turning out to be fabulously productive.

I particularly enjoyed David Bamman’s latest contribution to the discussion, which begins to flesh out what validation might look like for questions about plot. Briefly, he got five human readers to evaluate the emotional pitch of different scenes in Romeo and Juliet, and visualized the range of their agreement over time.

bamman
It’s clear that there are differences; but it’s also clear that there’s a great deal of consensus. And not surprisingly. Romeo and Juliet is (spoiler alert) a tragedy, and the simple, strong difference in perceived tone between the first and second halves of the script is exactly what we might have expected.

David offered this brief project as an example of data one could use for validating methods, which it is. But mulling this over online with Ana-Maria Popescu (whose tweets are alas protected), I realized that David’s example might also help give us a sharper sense of the literary stakes of this whole discussion. Because of course the question arises, “Will the emotional trajectory of novels be as easy to chart as that of 16/17c drama?” We intuitively suspect not, and for good reason. As Popescu put it, “work … from that period (Elizabethan) would have a more clear pattern (bc. they used plot patterns).”

She’s right. It’s a well-worn thesis about the rise of the novel that the point of novelistic realism was, partly, to get away from the predictable trajectories of comedy, tragedy, and romance — to produce a messier arc with lots of contingent interruptions (people hate it when I cite this guy, but that’s Ian Watt’s conception of formal realism). If that’s true, David’s experiment might not work as well for novels.

Matt Jockers’ syuzhet package is based on a diametrically opposed account of novelistic plot, coming through Kurt Vonnegut. Vonnegut argued that novels are really still organized by a small number of predictable patterns moving, in fairly broad undulations, between fortune and misfortune. And … wait, that sounds plausible too.

The conflict between Vonnegut and Watt might give us a testable question with clear literary stakes. Are the perceived emotional trajectories of novels in fact more complex over time, or more uncertain at any given moment, than the perceived trajectories of (say) 17c comedy and tragedy? Watt says they should be. Vonnegut says no. To be sure, there are lots of complexities involved in answering this; “emotional valence” is still not very well defined. But with a question like this, where theories of the novel clash directly, it’s hard to fail — whatever you discover, you’re going to be overturning some well-documented received opinion.

There are potentially lots of ways to approach a problem like that. David’s sort of ground truth could be used as a foundation for predictive modeling, or we could use it to validate Jockers’ method. By the way, if anyone’s still interested in doing that, here’s the trajectory you get if you run Romeo and Juliet though syuzhet using afinn sentiment detection and a low-pass setting of 5. Compare it to Bamman’s human ground truth above. One example is not validation, and this is just an eyeball comparison, but it’s a pretty decent fit. And syuzhet was incredibly easy to install and run. I did this in literally five minutes. My gut is starting to tell me that’s a nice little R package Matt just gave away for free.

syuzhetRomeo
Then again, if predictive models or sentiment detection don’t work well enough to satisfy us, there’s no reason why a question like this couldn’t be pursued purely through human annotation. I don’t have time to tackle this question; I’m working on a different project where human ground truth is provided by reviewers. But I really think someone should go for it.

Robert Boyle's description of a controversial, leaky air-pump.

Robert Boyle’s description of a controversial, notoriously leaky air-pump.


For me the lesson of this conversation has also been that the open web and dissent are still good things. I’m glad Matt Jockers put syuzhet out there as a resource, and glad Annie Swafford critiqued it. I’ve been saying this reminds me of the Hobbes-Boyle dispute; I mean partly, as Anna Marie Roos points out in a review of Leviathan and the Air-Pump, that the clash between opposing interpretations in that case fruitfully advanced knowledge.

I also mean, of course, that experiments, with clearly defined predictive hypotheses, are good things.

.

PS: By the way, if anyone’s interested, here’s Romeo and Juliet smoothed with a rolling mean (using a 101-sentence window) rather than a Fourier transform. I still understand rolling means better, and I think the detail revealed here is interesting. The balcony scene is, unsurprisingly, the high point for human readers and sentiment detection alike. As David Bamman points out, readers are a bit divided about how to interpret the tone at the end of this tragedy. Syuzhet, however, considers it a downer.

rollmeanromeo2
And Bamman’s human readers again:

bamman
P.P.S: Thanks to David Wilson-Okamura for correcting my labeling of scenes.

Why it’s hard for syuzhet to be right or wrong yet.

I’ve enjoyed following the exchange between Matt Jockers, Annie Swafford, Jacob Eisenstein, and Dan Piepenbring about Jockers’ R package syuzhet — designed to illuminate plot by tracing the “emotional valence” of narration across the course of a novel.

I’ve found this a consistently impressive and informative conversation; it has taught me literally everything I know about “low-pass filters.” But I have no idea who is right or wrong.

More fundamentally, I’m unsure how anyone could be right or wrong here, because as far as I can tell there’s no thesis under discussion yet. Jockers’ article isn’t published. All we have is an R package, syuzhet, which does something I would call exploratory data analysis. And it’s hard to evaluate exploratory data analysis in the absence of a specific argument.

For instance, does syuzhet smooth plot arcs appropriately? I don’t know. Without a specific thesis we’re trying to test, how would we decide what scale of variation matters? In some novels it might be a scene-to-scene rhythm; in others it might be a long arc. Until I know what scale of variation matters for a particular question, I have no way of knowing what kind of smoothing is “too much” or “too little.”*

The same thing goes, more fundamentally, for the concepts of “plot” and “emotional valence” themselves. As Jacob Eisenstein has pointed out, these aren’t concepts that have a single agreed-upon meaning. To argue about them meaningfully, we’re going to need a particular historical or formal question we’re trying to solve.

It seems to me likely that syuzhet will usefully illuminate some aspects of plot. But I have no way of knowing which aspects until I look at a test involving groups of books that readers perceive as different in some specific way. For instance, if syuzhet reliably discriminates between books with tragic and comic endings, that would already be interesting. It’s not everything we mean by plot, but it’s one important thing.

The underlying issue here is that Matt hasn’t published his article yet. So we don’t actually have a thesis to debate. What we have is a new form of exploratory data analysis, released as an R package. Conversation about exploration can be interesting; it can teach me a lot about low-pass filters; but I don’t know how it could be wrong or right until I know what the exploration is trying to reveal.

I think this holds even for Matt’s claim that he’s identified six (or seven) fundamental plot patterns. That sounds like a thesis, but I would tend to say it’s still description of exploratory analysis — in this case a clustering process. Matt has done the clustering in a principled and careful way, but clustering is still (in my eyes) basically an exploratory method. I’m not sure how to evaluate it until I know what kind of generic or historical evidence would count as confirmation that we’re looking at a coherent “plot pattern.”

There are a range of ways to get that confirmation. Lynn Cherny has explored plot using supervised methods; if you do that, predictive accuracy gives you an easy test. But unsupervised methods can also be great, in cases where tests aren’t so easy to define; it’s just that an unsupervised method needs to be supplemented by historical or formal discussion that tells you what would count as confirmation for this method. I imagine there will be some of that in Matt’s article, when it comes out.

* [Edit March 31: After playing around with some artificial data myself, I have to acknowledge that the low-pass filter option in syuzhet can behave in unintuitive ways where extreme outliers and edges are involved. I think Annie Swafford (in blog posts) and Daniel Lepage (below) have been right to emphasize this. It could be less of an issue with real data; I had to use pretty extreme outliers to “break” the filter; it’s not actually the case that the whole shape is necessarily defined by its single highest point. But my guess is that this sort of filter would only add value if you wanted to build in a strong prior that plot fluctuates on or near a particular “wavelength.” On the other hand, Matt Jockers has alluded to unpublished evidence for that sort of prior (or at least for a particular filter setting). So, after changing my opinion a couple times, I’m still not feeling I have an answer here.]

“Plot arcs” in the novel.

Ben Schmidt has developed a fascinating way of visualizing “plot arcs” in television series. I’ve been trying to understand how it works, with help from several people on Twitter, and also trying to see if it can reveal anything interesting about novels.

If you haven’t read Ben’s blog post, I recommend exploring it now, because I’m going to skim lightly over some of the details of his method.

cubes

At its core, the technique is not complicated. It hinges on a transformation called principal component analysis (PCA), which allows researchers to map high-dimensional data onto a two-dimensional space, while keeping individual data points as far apart as possible. You can think of PCA as a technique that gives you a “good viewing angle” for flattening out a complex object. For instance, if you’ve got eight points at the corners of a cube, you could represent them as seen in (a), but (b) might be more legible because it spreads the points out more. It does that by squashing several different physical dimensions (length and breadth) into the x axis on the page.

Ben uses this technique to reveal the structural relationship between different parts of a plot. As I understand it, he divides television scripts into six segments of equal length, and trains a topic model on all the segments. If you produce, say, 100 topics, each segment of each show is now characterized as a point in 100-dimensional space, where each dimension measures the prominence of one particular topic.

He takes the first sixth of every show and averages them to produce a single point that represents the average topic distribution for the first-sixth of all shows. After doing that for all six segments, he has six data points that represent typical segments of narrative time. Then he uses PCA to find an abstract space where those points are well separated. When he does this, he gets an arc-like structure that tends to preserve the original narrative sequence of the segments (although the algorithm isn’t directly informed about sequence). In his most detailed visualization, he even takes this down to twelfths.

Benjamin Schmidt's initial visualization of "plot arcs," December 16, 2014.

Benjamin Schmidt’s initial visualization of “plot arcs,” December 16, 2014.

But what does this mean?
From the beginning, Ben has been pretty careful to stress that he sees the parabolic shape of this pattern as an artifact of PCA. (“I should emphasize that it’s hard to imagine any other shape coming out of the PCA algorithm with the inputs I put in.”) David Bamman confirms this, showing that PCA will turn many kinds of sequential data, even random walks, into an arc. The algorithm is also good at inferring sequence: if point 1 influences point 2, and point 2 influences point 3, etc., PCA will tend to preserve their sequential relationship in the projection. (It does this even if you take 1000 different random walks and add them up to produce a composite walk.) So if we believe that the topic distribution in each segment of each story is strongly related to the topic distributions on either side, we would expect PCA to organize the composite segments of all stories in a sequential arc.

That’s sort of cool, but also suggests that the structure we’re seeing is not unique to “plots.” On the other hand, it’s worth noting that the technique does work better on fiction (and television scripts) than on nonfiction. Or, rather, it shows us something different when you apply it to nonfiction.

nonfiction

Here I’ve divided 2000 volumes of nineteenth-century nonfiction into ten parts, trained 200 topics on all 20,000 segments, and then created composite data points that represent the first “tenth,” second “tenth,” and so on, for all the volumes. PCA is still, somewhat remarkably, able to organize these points in the right sequence, but you have to squint a little to call this an arc. The graph is more clearly dominated by a contrast between introductions and body text. I’ve plotted two of the most important organizing topics as vectors; they include a lot of high-level abstractions and metadiscourse, whereas most of the topics in this nonfiction model are as specific as “birds eggs young wings” (and have a much smaller influence on this graph).

It’s important to note that I’m using the page-level metadata I recently described to select nonfiction here, which makes an effort to screen out paratext. (Otherwise we would probably be seeing topics like “table contents” and “index due date”!)

So where does this leave us? I think Lynn Cherny is right to say that with this technique, deviations from an arc are more significant than the arc itself. The slightly arc-like sequence on the right-hand side of the nonfiction graph isn’t telling us much about deep structures organizing nonfiction; it’s telling us mainly that there are continuities in text. But the “1” way over on the left-hand side is revealing a large structural fact: works of nonfiction have prefaces and introductions that can be very different from the rest of the text. Similarly, one of the most interesting aspects of Ben’s post involves the structural differences he finds toward the end between television genres (the difference between beginning and end seems more important for comedies, whereas science fiction is more organized by a contrast between central action and frame). Not a bad result for a historian to generate in his spare time.

Ten points that represent composite "tenths" of 1.981 works of fiction, topic-modeled and projected by PCA. Multivolume works have been joined.

Ten points that represent composite “tenths” of 1.981 works of fiction, topic-modeled and projected by PCA. Multivolume works have been joined.

Also, when I say differences are interesting, I don’t mean that the composite arc Ben saw by averaging all genres was meaningless. The fact that PCA will organize ten segments of 2000 novels into a parabola is not surprising. It would do that even with a random sequence. But in practice we’re not looking at random sequences, so PCA organizes points into a parabola by drawing on actual linguistic gradients that organize narrative time. As Ben has shown in a follow-up post, PCA is able to explain the patterns in television scripts better than it can explain random sequences.

In other words, the differences we’re seeing between beginnings, middles, and ends are real differences. And it’s interesting to see what those differences are. The x and y axes in a PCA projection don’t have simple meanings, because we’ve squashed multiple dimensions into two. But we can understand the space a little better by mapping the influence exerted by different topics.

Vectors that play an especially strong role in organizing the PCA projection of 1,981 nineteenth-century novels.

Vectors that play an especially strong role in organizing the PCA projection of 1,981 nineteenth-century novels.

In this visualization, for instance, topics associated with dialogue (“said am know yes”) tend to move a point up the y axis. They’re more common in the middle of a narrative.

It might also be interesting to compare the way narratives from different authors or genres project into this space.

Each author here is represented by a composite set of ten segments of narrative time, produced by averaging her works.

Each author here is represented by a composite set of ten segments of narrative time, produced by averaging her works. They are projected into a space defined by the average “tenths” of all works in the dataset.

Mary Elizabeth Braddon is a sensation novelist, and her works are strongly organized by a structure that resembles the majority of other novels in the nineteenth century (or is perhaps even more distinct than usual). A book like Lady Audley’s Secret begins with a stage-setting description of domestic space and family relationships. The middle of the book is characterized by dialogue. The tone of the diction becomes progressively more sentimental* until, in the conclusion, we back away from dialogue again to summary (but a summary that is very different from the introduction in tone).

By contrast, the novels of George Eliot are… um, perhaps it would be safest to say “not as well characterized by this model of narrative sequence.” You might be tempted to look at that tangle of lines and infer some kind of cyclic structure, but it would be a bit like reading tea leaves. I know George Eliot’s novels are interesting, but I doubt that squiggle tells me why. (It’s important to remember, for instance, that Eliot’s narrative time looks more orderly and arc-like when projected into a space defined by her own writing.)

Supervised and unsupervised models
In short, I think the method Ben has developed is interesting and worth further exploration, but I also think there are real interpretive challenges here. And the interpretive challenges are not general problems that would arise with any quantitative method: they’re specific to a quirk of this one, which is that it’s poised delicately between strategies of “supervised” and “unsupervised” modeling.

Actually, I’m not sure it’s technically accurate to call PCA a model at all; it’s almost a descriptive statistic (like the mean or standard deviation of a dataset). But the attraction of the technique is a bit like the attraction of unsupervised modeling: you turn it loose on the data and it spontaneously reveals patterns.

There’s nothing at all wrong with that, but the tricky thing here is that by focusing PCA on the temporal sequence within works, we actually give it a very strong bias toward a particular sort of pattern (a sequential arc). Which means we’re actually doing something that’s a bit more supervised than it might appear. It’s more like saying “if you assume narrative time is parabola-shaped, what would be the linguistic vectors organizing that space?”

That may not be a bad question! A lot of critics have assumed that narrative time is loosely shaped like a triangle or pyramid. So this might be a very reasonable starting assumption. But it’s important to understand that we are starting with an assumption, and there are different assumptions you could make. Matt Jockers has a different way of mapping plot — by using sentiment analysis to trace the rising or falling tone of discourse as we move through the narrative. Lynn Cherny has used supervised modeling to identify “exciting” passages in popular novels and then used that as a lever to map rhythms that move, for instance, between dialogue and exposition.

All these approaches are interesting, and potentially valid; I just think it’s important to note that none of them are giving us an unsupervised model of plot. (Even unsupervised models do make assumptions, but I would say a topic model, for instance, is slightly more open-ended than an approach that implicitly maps sequences onto arcs.) There’s nothing wrong with assuming an arc, but there might be some advantage to doing it more explicitly. If I were going to use Ben’s insight to study plot in nineteenth-century novels, I would probably drop PCA and instead train two classifiers to recognize the “ends” and “middles” of narratives. When you do that, you get a result that is actually quite parallel to the one I got by using PCA.

The average probabilities two classifiers assigned to segments from different "tenths" of 1,981 novels. Five-fold crossvalidated, but I didn't rule out the possibility that an author might appear in both the test set and the training set.

The average probabilities two classifiers assigned to segments from different “tenths” of 1,981 novels. Five-fold crossvalidated, but I didn’t rule out the possibility that an author might appear in both the test set and the training set.

But with a predictive model like a classifier, I feel a little more confident in my ability to characterize the strength of the patterns I’m seeing. In this case, for instance, the classifier that recognizes ends was about 62% accurate out of sample. The classifier that recognizes middles was about 61% accurate, and since I counted six out of ten segments of each narrative as “the middle,” that’s not a lot better than random. [Later edit: This was a hasty first pass. Some simple normalization got the classifiers up to 67% and 64%. That signal is probably strong enough for people to do more interesting things with it.]

However, I want to be clear: I don’t think there’s anything wrong with using PCA for this, as long as we realize that it’s surprisingly good at inferring sequence from random walks in high-dimensional space. If plots are “arcs” (as critics have tended to assume), why not make use of that insight to analyze and visualize them? Ben’s post shows us one way to do that. Another thing I take away from this exploration is how amazing Twitter can be, because I couldn’t have fully understood what was going on here without contributions from a lot of different people.

* Re: “the tone of the diction becomes progressively more sentimental:” Matt Wilkens points out that the vectors that characterize endings here have a lot in common with the language that Sara Steger identified as characteristic of 19c sentimental fiction.

Postscript Jan 5: Have to admit I’ve found it hard to stop exploring this method. I ran it on a fiction dataset expanded to 4,000 works, and to 1922, and patterns started to become a little more legible. For instance, when I include more of her works, George Eliot no longer looks as idiosyncratic. It’s also kind of interesting to superimpose plot arcs for three different periods. Here I’ve borrowed Ben’s idea of using PCA so to speak “out-of-sample,” since each of these periods is actually projected into a different space (defined by the other two periods).

Generalized narrative arcs for 4,000 works of fiction from 1700 to 1922. Very few of them are actually before 1800, though.

Generalized narrative arcs for 4,000 works of fiction from 1700 to 1922. In each case we’re plotting ten composite points representing the topic distributions for segments of narrative time, and time moves from left to right. The dataset does include reprints.

The fact that these arcs float upward may confirm something we already knew, which is that fiction tends to move away from “summary” and toward direct presentation of “scene” as historical time passes. But I think the stability of the pattern is also significant. As Ben has shown, there’s no guarantee that you’ll get an arc if you project a dataset into a PCA space defined by a different dataset. The congruence of these three arcs may not quite prove that plot *is* an arc, but it does suggest that linguistic signals of “beginnings,” “middles,” and “ends” remained broadly similar from the early nineteenth century through the early twentieth. If we wanted to confirm that, we could make more direct comparisons, but for exploratory visualization I see how PCA is useful here.