Category: plot

Can language models predict the next twist in a story?

Post author By tedunderwood
Post date January 5, 2024
2 Comments on Can language models predict the next twist in a story?

While distant reading has taught us a lot about the history of fiction, it hasn’t done much yet to explain why we keep turning pages.

“Suspense” is the word we use to explain that impulse. But what is suspense? Does it require actual anxiety, or just uncertainty about what happens next? If suspense depends on not knowing what will happen, how can we enjoy re-reading familiar books? (Smuts 2009) And why do we enjoy being surprised? (Tobin 2018)

Beyond these big theoretical puzzles, there are historical questions scholars might like to ask about the way authors use chapter breaks to structure narrative revelation (Dames 2023, 219-38).

Right now, distant reading can’t fully answer any of these questions. When we want to measure surprise or novelty, for instance, we typically measure change in the verbal texture of a story from beginning to end. I made a coarse attempt of that kind in a blog post a few years ago. Other articles use better methods, and give us new ways to think about form (McGrath et al. 2018, Piper et al. 2023). But how closely does the pace of verbal change correlate with readers’ experience of uncertainty or surprise? We don’t know.

Autoregressive language models offer a tempting new angle on this problem, because they’re trained specifically to predict how a given text will continue. Intuitively, it feels like we could measure the predictability of a plot by first asking a model to continue the story, and then measuring the divergence between predicted continuation and real text. Even if this isn’t exactly how readers form expectations and experience surprise, it might begin to give us some leverage on the question.

Researchers have run a loosely similar experiment on very short stories contributed by experimental subjects (Sap et. al 2020). But scaling that up to published novels poses a challenge. For one thing, language models may not be equally good at imitating every style. A contemporary model’s failure to predict the next sentence by Jane Austen might just mean that it’s bad at channeling the Regency.

So, to factor style out of the question, let’s ask a model to predict what will happen in, say, the next three pages of a story — and then compare those predictions to its own summaries of the pages when it sees them.

Readers of a certain age will recognize this as a game Ernie invites Bert to play on “Sesame Street.”

Ernie asks Bert “what happens next” in this picture. Bert anticipates that the man will step in the pail, and disaster will ensue.

To spell the method out more precisely: we move through a novel roughly 900 words at a time. On each pass, we give a language model both a recap of earlier events, and a new 900-word passage. We ask the model to summarize the new passage, and also ask it to predict what will happen next. Then we compare its prediction to the summary it generates when it actually sees the next passage, and measure cosine distance between the two sentence embeddings. A large distance means the model did a poor job of predicting “what would happen next.”

Does this have any relation to human uncertainty?

I’m not claiming that this is a good model of the way readers experience plot. We don’t have a good model of that yet! The more appropriate question to ask is: Does this correlate at all with anything human readers do?

We can check by asking a reader to do the same thing: read roughly 900-word passages and make predictions about the future. Then we can compare the human reader’s predictions to automatically-generated summaries.

Passages were drawn from *Now in November*, by Josephine Johnson, and *Murder is Dangerous,* by Saul Levinson. n = 51 passages, Pearson’s r = .41, p < .01. Human predictions are more variable in quality than the model’s.

When I did this for two novels that were complete blanks to me, my predictions tended to diverge from the actual course of the story in roughly the same places where the model found prediction difficult. So there does seem to be some relationship between a language model’s (in)ability to see what’s coming and a human reader’s.

The image above also reveals that there are broad, consistent differences between books. For both people and models, some stories are easier to predict than others.

A reason not to trust this

Readers of this story may already anticipate the next twist—which is that of course we shouldn’t use LLMs to study uncertainty, because these models have already read many of the books we’re studying and will (presumably) already know the plot.

This is a particularly nasty problem because we don’t have a list of the books commercial models were trained on. We’re flying blind. But before we give up, let’s test how much of a problem this really poses. Researchers at Berkeley have defined a convenient test of the extent to which a model has memorized a book (Chang et al. 2023). In essence, they ask the model to fill in missing names.

Running this test, Chang et al. find that GPT-4 remembers many books in detail. Moreover, its ability to fill in masked names correlates with its accuracy on certain other tasks—like its ability to estimate date of publication. This could be a problem for questions about plot.

To avoid this problem (and also save money), I’ve been using GPT-3.5, which Chang et al. find is less prone to memorize books. But is that enough to address the problem? Let’s check. Below I’ve plotted the average divergence between prediction and summary for 25 novels on the y axis, and GPT-3.5’s ability to supply masked names in those texts on the x axis. If memorization was making prediction more accurate, we would expect to see a negative correlation: predictions’ divergence from summaries should go down as name_cloze accuracy goes up.

The y axis is average cosine distance between prediction and summary; x axis is GPT-3.5’s accuracy on the name cloze test defined in Chang et al.

25 books is not enough for a conclusive answer, but so far I cannot measure any pattern of that kind. (If anything, there is a faint trend in the opposite direction.)

In an ideal world, researchers would use language models trained on open data sets that they know and control. But until we get to an ideal world, it looks like it may be possible to run proof-of-concept experiments with things like GPT-3.5, at least if we avoid extremely famous books.

Scrutinizing the image above, readers will probably notice that the most predictable book in this sample was Zoya, by Danielle Steel. Although Steel has a reputation that may encourage disparaging inferences — see Dan Sinykin, Big Fiction, for why — I don’t think we’re in a position to draw those inferences yet. The local rhythms that make prediction possible across three pages are not necessarily what critics mean when they use “predictable” to diss a book.

So what could we learn from predicting the next three pages?

To consider one possible payoff: it might give us a handle on the way chapter-breaks, and other divides, structure the epistemic rhythms of fiction. For instance, many readers have noticed that the installments of novels originally published in magazines tend to end with an explicit mystery to ensure that you keep reading (Haugtvedt 2016 and Beekman 2017). In the first installment of Arthur Conan Doyle’s Hound of the Baskervilles (which covers two chapters), Watson and Holmes learn about a legendary curse that connects the family of the Baskervilles to a fiendish hound. In the final lines of the first installment, Holmes asks the family doctor about footprints found near the body of Sir Charles Baskerville. “A man’s or a woman’s?” Holmes asks. The doctor’s “voice sank almost to a whisper as he answered. ‘Mr Holmes, they were the footprints of a gigantic hound!'” End installment.

Novels don’t have soundtracks. But unexplained, suggestive new information is as good as a sting: “Bum – bum – BUM!”

The Hound of the Baskervilles, by Sidney Paget, 1902.

It appears that we can measure this cliffhanger effect: the serial installments of The Hound of the Baskervilles often end with a moment of heightened mystery — at least, if inability to predict the next three pages is any measure of mystery. When we measure predictive accuracy throughout the story, making four different passes to ensure we have roughly-900-word chunks aligned with all the chapter breaks, we find that predictions are farther from reality at the ends of serial installments. There is no similar effect at other chapter breaks.

The mean for breaks at serial installments is more than one standard deviation above the mean for other chapter-breaks. In spite of tiny n, this is actually p < .05.

Now, this is admittedly a cherry-picked example. So far, I have only looked at seven novels where we can distinguish the ends of serial installments from other kinds of chapter break (using data from Warhol et al). And I don’t see this pattern in all of them.

So I’m not yet making any historical argument about serialization and the rise of the cliffhanger. I’m just suggesting that it’s the kind of question someone could eventually address using this method. A doctoral student could do it, for instance, with a locally hosted model. (I don’t recommend doing it with GPT-3.5, because I dropped $150 or so on this post, and that might add up across a dissertation.) Some initial tests suggest to me that this approach will produce results significantly different than we’re getting with lexical methods.

Since I’m explicitly encouraging people to run with this, let me also say that someone actually writing a paper using this method might want to tinker with several things before trusting it:

Measuring the distance between the embedding of one prediction sentence and one summary sentence is a crude way to measure expectation and surprise. Readers don’t necessarily form a single expectation about plot. Maybe it would be better to model expectation as a range of possibility?
Related to this: models may need to be nudged to speculate and not just predict that current actions will continue.
900-word chunks may not be the only appropriate scale of analysis. When readers talk about narrative surprise they’re often thinking about larger arcs like “who will he marry?” or “who turns out to be the murderer?”
We need a way to handle braided narratives where each chapter is devoted to a different group of characters (Garrett 1980). In a multi-plot story, the B or C plot will often not continue across a chapter break.

But we’re in a multi-plot narrative ourselves, so those problems may be solved by a different group of characters. This was just a blog post to share an idea and get people arguing about it. Tune in next time, for our thrilling conclusion. (Bum – bum – BUM!)

Code and data used for this post are available on Github.

The ideas discussed here were previously presented in Paris at a workshop on AI for the analysis of literary corpora, and in Copenhagen in a conference on generative methods in the social sciences and humanities. I’d like to thank the organizers of those events, esp. Thierry Poibeau, Anders Munk, and Rolf Lund, for stimulating conversation — and also many people in attendance, especially David Bamman, Lynn Cherny, and Meredith Martin. In writing code to query the OpenAI API, I borrowed snippets from Quinn Dombrowski (and also of course from GPT-4 itself oh brave new world &c). My thinking about 19c serialization was advanced by conversation with David Bishop and Eleanor Courtemanche, and by suggestions from Ryan Cordell and Elizabeth Foxwell on Bluesky.

References

Beekman, G. (2017) “Emotional Density, Suspense, and the Serialization of The Woman in White in All theYear Round.” Victorian Periodicals Review 50.1.

Chang, K., Cramer, M., Soni, S., & Bamman, D. (2023) “Speak, Memory: An Archaeology of Books known to ChatGPT / GPT-4,” https://arxiv.org/abs/2305.00118.

Dames, N. (2023) The Chapter: A Segmented History from Antiquity to the Twenty-First Century. Princeton University Press.

Garrett, P. (1980) The Victorian Multiplot Novel: Studies in Dialogical Form. Yale University Press.

Haugtvedt, E. (2016) “The Sympathy of Suspense: Gaskell and Braddon’s Slow and Fast Sensation Fiction in Family Magazines.” Victorian Periodicals Review 49.1.

McGrath, L., Higgins, D., & Hintze, A. (2018) “Measuring Modernist Novelty.” Journal of Cultural Analytics. https://culturalanalytics.org/article/11030-measuring-modernist-novelty

Piper, A., Xu, H., & Kolaczyk, E. D. (2023) “Modeling Narrative Revelation.” Computational Humanities Research 2023. https://ceur-ws.org/Vol-3558/paper6166.pdf

Sap, M., Horvitz, E., Choi, Y., Smith, N. A., & Pennebaker, J. (2020) “Recollection Versus Imagination: Exploring Human Memory and Cognition via Neural Language Models.” Proceedings of the 58th Annual Meeting of the ACL. https://aclanthology.org/2020.acl-main.178/

Sinykin, Dan. Big Fiction: How Conglomeration Changed the Publishing Industry and American Literature. Columbia University Press, 2023.

Smuts, A. (2009) “The Paradox of Suspense.” Stanford Encyclopedia of Philosophy. https://plato.stanford.edu/entries/paradox-suspense/#SusSur

Tobin, V. (2018) The Elements of Surprise: Our Mental Limits and the Satisfactions of Plot. Harvard University Press.

Warhol, Robyn, et al., “Reading Like a Victorian,” https://readinglikeavictorian.osu.edu/.

Tags ai, artificial intelligence, chatgpt, digital humanities

fiction plot

How predictable is fiction?

Post author By tedunderwood
Post date July 5, 2020
8 Comments on How predictable is fiction?

This blog post is loosely connected to a talk I’m giving (virtually) at the Workshop on Narrative Understanding, Storylines, and Events at the ACL. It’s an informal talk, exploring some of the challenges and opportunities we encounter when we take the impressive sentence-level tools of contemporary NLP and try to use them to produce insights about book-length documents.

Questions about the “predictability” of fiction started to interest me after I read a preprint by Maarten Sap et al. on the difference between “recollected” and “imagined” stories. There’s a lot in the paper, but the thing that especially caught my eye was that a neural language model (GPT) does better predicting the next sentence in imagined stories than in recollected stories about biographical events. The authors persuasively interpret this as a sign that imagined stories have been streamlined by a process of “narrativization.”

The stories in that article are very short narratives made up (or recalled) by experimental subjects. But, given my background in literary history, I wondered whether the same contrast might appear between book-length works of fiction and biography. Are fictional narratives in some sense more predictable than nonfiction?

One could say we already know the answer. Fiction is governed by plot conventions, so of course it makes sense that it’s predictable! But an equally intuitive argument could be made that fiction entertains readers by baffling and eluding their expectations about what, specifically, will happen next. Perhaps it ought to be less predictable than nonfiction? In short, there are basic questions about fiction that don’t have clear general answers yet, although we’re getting better at framing the questions. (See e.g. Caroline Levine on The Serious Pleasures of Suspense, Vera Tobin on Elements of Surprise, or Andrew Piper’s chapters on “Plot” and “Fictionality” in Enumerations.)

Plus, even if it were intuitively obvious that fiction is more strongly governed by plot conventions than by surprise, it might be interesting to measure the strength of those conventions in particular works. If we could do that, we’d have new evidence for a host of familiar debates about tradition and innovation.

So, how to do it? Sap et al. measure “narrative flow” by using a neural language model that can judge whether a sentence is likely to occur in a given context. It’s a good strategy for paragraph or page-sized stories, but I suspect sentences may be too small to capture the things we would call “predictable plot patterns” in novels. However, it wasn’t hard to give this strategy a spin, so I did, using a language model called BERT to assess pairs of sentences from 32 biographies and 32 novels. (This is just a toy-sized sample for a semi-thought-experiment; I’m not pretending to finally resolve anything.) At each step, in each book, I asked BERT to judge the probability that sentence B would really follow sentence A. (The code I used is in a GitHub repo.)

The result I got was the opposite of the one reported in Sap et al. There is a statistically significant difference between biography and fiction, but the pairs of sentences in biography appeared more predictable—more likely to follow each other—than the sentences in fiction. I hasten to say, however, that this could be wrong in several ways. First, BERT’s perception that two sentences are likely to follow each other correlates strongly with the length of the sentences. Short sentences (like most sentences in dialogue) seem less clearly connected. Since there’s a lot of dialogue in published fiction, BERT might be, in effect, biased against fiction.

Fig. 1. Two different ways of measuring continuity between some sample sentences.

More importantly, sentence-level continuity isn’t necessarily a good measure of surprise in novel-length works. For instance, in fig. 1, you’ll notice that BERT is unruffled when Pride and Prejudice morphs into Flatland. As long as each sentence picks up some discursive cue from the one before, BERT perceives the pairs as plausibly connected. But by the fourth sentence in the chain, Mr Bennet is listening to a lecture from a translucent, blue, four-dimensional being in his sitting room. Human readers would probably be surprised if this happened.

There are ways to generate “sentence embeddings” that might correspond more closely to human surprise. (This is a crowded field, but see for instance Sentence-BERT, Reimers and Gurevych 2019.) Even primitive 2014-era GloVe embeddings do a somewhat better job (Pennington, Socher, and Manning 2014). By averaging the GloVe embeddings for all the words in a sentence, we can represent each sentence as a vector of length 300. Then we can measure the cosine distances between sentences, as I’ve done in the third column of Fig 1. (Here, large numbers indicate a big gap between sentences; it’s the reverse of the “probability” measure provided by BERT, where high numbers represent continuity.) This model of distance is (appropriately) more surprised by the humming blue sphere in row three than by the short sentence of dialogue in row five.

But even if we had a good measure of continuity, sentences might just be too small to capture the patterns that count as “predictability” in a novel. As the example in fig. 1 suggests, a sequence of short steps, individually unsurprising, can leave the reader in a world very different from the place they started. Continuity of this kind is not the “predictability” we would want to measure at book scale.

When readers talk about predictable or unpredictable stories, they’re probably thinking about specific problem situations and possible outcomes. Will the protagonist marry suitor A or suitor B? Can we guess? It may soon be possible to automatically extract implicit questions of this kind from fiction. And the Story Cloze task (Mostafazadeh et al.) showed that it’s possible to answer “what happens next” at paragraph scale. But right now I don’t know how to extract implicit questions, or answer them, at the scale of a novel. So let’s try a simpler—in fact minimal— predictive task. Given two passages selected at random from a book, can we predict which came first? Doing that won’t tell us anything about plot—if “plot” is a causal connection between events. But it will tell us whether book-length works are organized by any predictable large-scale patterns. (As we’ll see in a moment, this is a real question, and in some genres the answer might be “not really.”)

The vector-space representation we developed in the third column of Fig. 1 can be scaled up for this question. “Paragraphs” and “chapters” mean different things in different periods, so for now, it may be better simply to divide stories into arbitrary thousand-word passages. Each passage will be represented as a vector by averaging the GloVe embeddings for the words in it; we’ll subtract one passage from the other and use the difference to decide whether A came before B in the book, or vice-versa.

Fig. 2. Accuracy of sequence prediction for randomly selected pairs of passages from detective novels, or novels randomly selected from the whole Chicago Novel Corpus. Regularized logistic regression is trained on 47 volumes and tested on the 48th; the boxplots represent the range of mean accuracies for different held-out volumes.

Random accuracy for this task would be 50%, but a model trained on a reasonable number of novels can easily achieve 65-66%, especially if the novels are all in the same genre. That number may not sound impressive, but I suspect it’s not much worse than human accuracy would be—if a human reader were asked to draw the arrow of time connecting two random passages from an unfamiliar book.

In fact, why is it possible to do this at all? Since the two passages may be separated by a hundred-odd pages, our model clearly isn’t registering any logical relationships between events. Instead, it’s probably relying on patterns described in previous work by David McClure and Scott Enderle. McClure and Enderle have shown that there are strong linguistic gradients across narrative time in fiction. References to witnesses, guilt, and jail, for instance, tend to occur toward the end of a book (if they occur at all).

Fig. 3. David McClure, “A Hierarchical Cluster of Words Across Narrative Time,” 2017.

Our model may draw even stronger clues from simple shifts of rhetorical perspective like the one in figure 3: indefinite articles appear early in a book, when “a mysterious old man” enters “a room.” A few pages later, he will either acquire a name or become “the old man” in “the room.”

Fig. 4. David William McClure and Scott Enderle, “Distribution of Function Words Across Narrative Time in 50,000 Novels,” ADHO 2018.

We probably wouldn’t call that shift of perspective “plot.” On the other hand, before we dismiss these gradients as merely linguistic rather than narrative phenomena, it’s worth noting that they seem to be specific to fiction. When I try to use the same general strategy to predict the direction of time between pairs of passages in biographies, the model struggles to do better than random guessing. Even with the small toy sample I’m using below (32 novels and 32 biographies), there is clearly a significant difference between the two genres. So, although BERT may not see it, fictional narratives are more predictable than nonfiction ones when we back out to look at the gradient of time across a whole book. There is a much clearer difference between before and after in fiction.

Fig. 5. Range of accuracies for a regularized logistic regression model trained to identify the earlier of two 1000-word passages.

“A predictable difference between before and after” is something a good bit cruder than we ordinarily mean by “plot.” But the fact that this difference is specific to fiction makes me think that a model of this kind may after all confirm some part of what we meant in speculating “fictional plots are shaped by conventions that make them more predictable than nonfiction.”

Of course, to really understand plot, we will need to pair these loose book-sized arcs with a more detailed understanding of the way characters’ actions are connected as we move from one page to the next. For that kind of work, I invite you to survey the actual papers accepted for the Workshop on Narrative Understanding <gestures at the program>, which are advancing the state of the art, for instance, on event extraction.

But I can’t resist pointing out that even the crude vector-space model I have played with here can give us some leverage on page-level surprise, and in doing so, complicate the story I’ve just told. One odd detail I’ve noticed is that the predictability of a narrative at book scale (measured as our ability to predict the direction of time between two widely separated passages) correlates with a kind of unpredictability as we move from one sentence, page, or thousand-word passage to the next.

For instance, one way to describe the stability of a sequence is to measure “autocorrelation.” If we shift a time series relative to itself, moving it back by one step, how much does the original series correlate with the lagged version?

Fig 6. These are wholly imaginary curves to illustrate an idea.

A process with a lot of inertia (e.g., change in temperature across a year) might still have the same basic shape if we shift it backward eight hours. The amount of sunlight in Seattle, on the other hand, fluctuates daily and will be largely out of phase with itself if we shift it backward eight hours; the correlation between those two curves will be pretty low, or even negative as above.

Since we’re representing each passage of a book as a vector of 300 numbers, this gives us 300 time series—300 curves—for each volume. It is difficult to say what each curve represents; the individual components of a word embedding don’t come with interpretable labels. But we can measure the narrative’s general degree of inertia by asking how strongly these curves are, collectively, autocorrelated. Crudely: I shift each time series back one step (1000 words) and measure the Pearson correlation coefficient between the lagged and unlagged version. Then I take the mean correlation for all 300 series.*

Fig 7. Relationship between the volatility of the text (low autocorrelation) and accuracy of models that attempt to put two passages in the right order. Although there are more fiction volumes, we keep accuracy comparable by training on only 32 volumes at a time.

The result is unintuitive. You might think it would be easier to predict the direction of narrative time in books where variables change slowly—as temperature does—tracing a reliable arc. But instead it turns out that prediction is more accurate in books where these curves behave a bit like sunlight, fluctuating substantially every 1000 words. (The linear relationship with autocorrelation is r = -.237 in fig 7, though I suspect the real relationship isn’t linear.) Also, biography appears to be distinguished from fiction by higher autocorrelation (lower volatility).

So yes, fiction is more predictable than nonfiction across the sweep of a whole narrative (because the beginnings and ends of novels are rhetorically very distinct). But the same observation doesn’t necessarily hold as we move from page to page, or sentence to sentence. At that scale, fiction may be more volatile than nonfiction is. I don’t yet know why! We could speculate that this has something to do with an imperative to surprise the reader—but it might also be as simple as the alternation of dialogue and description, which creates a lot of rapid change in the verbal texture of fiction. In short, I’m pointing to a question rather than answering one. There appear to be several different kinds of “predictability” in narrative, and teasing them apart might give us some simple insights into the structural differences between fiction and nonfiction.

Postscript: Everything above is speculative and exploratory. I’ve shared some code and data in a repository, but I wouldn’t call it fully replicable. There are more sophisticated ways to measure autocorrelation. If any economists read this, it will occur to them that we could also “predict the future course of a story” using full vector autoregression or an ARIMA model. I’ve tried that, but my sense is that the results were actually dominated by the two factors explored separately above (before-and-after predictability and the autocorrelation of individual variables with themselves). Also, to make any of this really illuminate literary history, we will need a bigger and better corpus, allowing us to ask how patterns like this intersect with genre, prestige, and historical change. A group of researchers at Illinois, including Wenyi Shang and Peizhen Wu, are currently pursuing those questions.

References:

Edwin A. Abbott, Flatland: A Romance of Many Dim ensions (London: 1884).

Sanjeev Arora, Yingyu Liang, Tengyu Ma, “A Simple but Tough-to-Beat Baseline for Sentence Embeddings.” ICLR 2017.

Austen, Jane. Pride and Prejudice. London: Egerton, 1813.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding. NAACL 2019.

Caroline Levine, The Serious Pleasures of Suspense (Charlottesville, University of Virginia Press, 2003).

David McClure, “A Hierarchical Cluster of Words Across Narrative Time,” 2017.

Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, James Allen. A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories. NAACL 2016.

Shay Palachy, “Document Embedding Techniques: A Review of Notable Literature on the Topic,” Towards Data Science, September 9, 2019.

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. EMNLP 2014.

Andrew Piper, Enumerations (Chicago: University of Chicago Press, 2018).

Nils Reimers and Iryna Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,” EMNLP-IJCNLP 2019.

Maarten Sap, Eric Horvitz, Yejin Choi, Noah A. Smith, James Pennebaker. Recollection Versus Imagination: Exploring Human Memory and Cognition via Neural Language Models. ACL 2020.

Vera Tobin, Elements of Surprise: Our Mental Limits and the Satisfactions of Plot (Cambridge: Harvard University Press, 2018).

Tags plot arcs