Categories
artificial intelligence social effects of machine learning

A more interesting upside of AI

My friends are generally optimistic, forward-looking people, but talking about AI makes many of them depressed. Either AI is scarier than other technologies, or public conversation about it has failed them in some way. Or both.

I think the problem is not just that people have legitimate concerns. What’s weird and depressing about AI discourse right now is the surprising void where we might expect to find a counterbalancing positive vision.

It is not unusual, after all, for new technologies to have a downside. Airplanes were immediately recognized as weapons of war, and we eventually recognized that the CO2 they produce is not great either. But their upside is also vivid: a new and dizzying freedom of motion through three-dimensional space. Is the upside worth the cost? Idk. Saturation bombing has been bad for civilians. But there is at least something in both pans of the scale. It can swing back and forth. So when we think about flight—even if we believe it has been destructive on balance—we can see tension and a possibility of change. We don’t just feel passively depressed.

Is “super-intelligence” the upside for AI?

What people seem to want to put on the “positive” side of the balance for AI is is a 1930s-era dystopia skewered well by Helen De Cruz.

The upside of AI, apparently, is that super-intelligence replaces human agency. This is supposed to have good consequences: accelerating science and so on. But it’s not exactly motivating, because it’s not clear what we get to do in this picture.

Sam Altman’s latest blog post (“The Gentle Singularity”) reassures us by telling us we will always feel there’s something to do. “I hope we will look at the jobs a thousand years in the future and think they are very fake jobs, and I have no doubt they will feel incredibly important and satisfying to the people doing them.”

If this is the upside on offer, I’m not surprised people are bored and depressed. First, it’s an unappealing story, as Ryan Moulton explains:

Secondly, as Ryan hints in his last sentence, Altman’s vision of the future isn’t very persuasive. Stories about fully automated societies where superintelligent AI makes the strategic decisions, coordinates the supply chains, &c, quietly assume that we can solve “alignment” not only for models but for human beings. Bipedal primates are expected to naturally converge on a system that allows decisions to be made by whatever agency is smartest or produces best results. Some version of this future has sounded rational and plausible to nerds since Plato. But somehow we nerds consistently fail to make it reality—in spite of our presumably impressive intelligence. If you want a vision of the future of “super-intelligence,” consider the fate of the open web or the NSF.

I’m not just giving a fatalistic shrug about politics and markets here. I think cutting the NSF was a bad idea, but there are good reasons why we keep failing to eliminate human disagreement. It’s a load-bearing part of the system. If you or I tried to automate, say, NSF review panels, our automated system would develop blind spots, and would eventually need external criticism. (For a good explanation of why, see Bryan Wilder’s blog post on automated review.) Conceivably that external criticism could be provided by AI. But if so, the AI would need to be built or controlled by someone else—someone independent of us. If a task is really important, you need legal persons who can’t edit or delete each other arguing over it.

AI as a cultural technology

The irreducible centrality of human conflict is one reason why I doubt that “super-intelligence” is the right frame for thinking about economic and social effects of AI. However smart it gets, a system that lacks independent legal personhood is not a good substitute for a human reviewer or manager. Nor do I think it’s likely that fractious human polities will redefine legal personhood so it can be multiplied by hitting command-C followed by command-V.

A more coherent positive picture of a future with AI has started to emerge. As the title of Ethan Mollick’s Co-Intelligence implies, it tends to involve working with AI assistance, not resigning large sectors of the economy to a super-intelligence. I’ve outlined one reason to expect that path above. Arvind Narayanan and Sayash Kapoor have provided a more sustained argument that AI capability is unlikely to exponentially exceed human capability across a wide range of tasks.

One reason they don’t expect that trajectory is that the recent history of AI has not tended to support assumptions about the power of abstract and perfectly generalizable intelligence. Progress was rapid over the last ten years—but not because we first discovered the protean core of intelligence itself, which then made all merely specific skills possible. Instead, models started with vast, diverse corpora of images and texts, and learned how to imitate them. This approach was frustrating enough to some researchers that they dismissed the new models as mere “parrots,” entities that fraudulently feign intelligence by learning a huge repertoire of specific templates.

A somewhat more positive response to this breakthrough, embraced by Henry Farrell, Alison Gopnik, Cosma Shalizi, and James Evans, has been to characterize AI as a “cultural technology.” Cultural technologies work by transmitting and enacting patterns of behavior. Other examples might include libraries, the printing press, or language itself.

Is this new cultural technology just a false start or a consolation prize in a hypothetical race whose real goal is still the protean core of intelligence? Many AI researchers seem to think so. The term “AGI” is often used. Some researchers, like Yann LeCun, argue that getting to AGI will require a radically different approach. Others suspect that transformer models or diffusion models can do it with sufficient scale.

I don’t know who’s right. But I also don’t care very much. I’m not certain I believe in absolutely general intelligence—and I know that I don’t believe culture is a less valuable substitute for it.

On the contrary. I’m fond of Christopher Manning’s observation that, where sheer intelligence is concerned, human beings are not orders of magnitude different from bonobos. What gave us orders of magnitude greater power to transform this planet, for good or ill, was language. Language vastly magnified our ability to coordinate patterns of collective behavior (culture), and transmit those patterns to our descendants. Writing made cultural patterns even more durable. Now generative language models (and image and sound models) represent another step change in our ability to externalize and manipulate culture.

Why “cultural technology” doesn’t make anyone less depressed

I’ve suggested that a realistic, potentially positive vision of AI has started to coalesce. It involves working with AI as a “normal technology” (Naranayan and Kapoor), one in a long sequence of “cultural technologies” (Gopnik, Farrell, et al) that have extended the collective power of human beings.

So why are my friends still depressed?

Well, if they think the negative consequences of AI will outweigh positive effects, they have every right to be depressed, because no one has proven that’s wrong. It is absolutely still possible that AI will displace people from existing jobs, force retraining, increase concentration of power, and (further) destabilize democracy. I don’t think anyone can prove that the upside of AI outweighs those possible downsides. The cultural and political consequences of technology are hard to predict, and we are not generally able to foresee them at this stage.

French aviator Louis Paulhan flying over Los Angeles in 1910. Library of Congress.

But as I hinted at the beginning of this post, I’m not trying to determine whether AI is good or bad on balance. It can be hard to reach consensus about that, even with a technology as mature as internal combustion or flight. And, even with very mature technologies, it tends to be more useful to try to change the balance of effects than to debate whether it’s currently positive.

So the goal of this post is not to weigh costs and benefits, or argue with skeptics. It is purely to sharpen our sense of the potential upside latent in a vision of AI as “cultural technology.” I think one reason that phrase hasn’t cheered anyone up is that it has been perceived as a deflating move, not an inspiring one. The people disposed to be interested in AI mostly got hooked by a rather different story, about protean general intelligences closely analogous to human beings. If you tell AI enthusiasts “no, this is more like writing,” they tend to get as depressed as the skeptics. My goal here is to convince people who use AI that “this is like writing” is potentially exciting—and to specify some of the things we would need to do to make it exciting.

Mapping and editing culture

So what’s great about writing? It is more durable than the spoken word, of course. But just as importantly, writing allows us to take a step back from language, survey it, fine-tune it, and construct complex structures where one text argues with two others, each of which footnotes fifty others. It would be hard to imagine science without the ability writing provides to survey language from above and use it as building material.

Generative AI represents a second step change in our ability to map and edit culture. Now we can manipulate, not only specific texts and images, but the dispositions, tropes, genres, habits of thought, and patterns of interaction that create them. I don’t think we’ve fully grasped yet what this could mean.

I’ll try to sketch one set of possibilities. But I know in advance that I will fail here, just as someone in 1470 would have failed to envision the scariest and most interesting consequences of printing. At least maybe we’ll get to the point of seeing that it’s not mostly “cheaper Bibles.”

Here’s a Bluesky post that used generative AI to map the space of potential visual styles using a “style reference” (sref).

In the early days of text-to-image models, special phrases were passed around like magic words. Adding “Unreal Engine” or “HD” or “by James Gurney” produced specific stylistic effects. But the universe of possible styles is larger than a dictionary of media, artistic schools, or even artists’ names can begin to cover. If we had a way to map that universe, we could explore blank spaces on the map.

Midjourney invented “style references” as simple way to do that. You create a reference by choosing between pairs of images that represent opposing vectors in a stylistic plane. In the process of making those choices, you construct your own high-dimensional vector. Once you have a code for the vector, you can use it as an newly invented adjective, and dial its effect up or down.

A map with just a touch of the style shared above by Susan Fallow, which adds (among other things) traces that look like gilding. (“an ancient parchment map showing the coastline of an imagined world –sref 321312992 –sv 4 –sw 20 –ar 2:1“)

“Style references” are modest things, of course. But if we can map a space of possibility for image models, and invent new adjectives to describe directions in that space, we should be able to do the same for language models.

And “style” is not the only thing we could map. In exploring language, it seems likely that we will be mapping different ways of thinking. The experiment called “Golden Gate Claude” was an early, crude demonstration of what this might mean. Anthropic mapped the effect of different neurons in a model, and used that knowledge to tune up one neuron and create a version of Claude deeply obsessed with the Golden Gate Bridge. Given any topic, it would eventually bring the conversation around to fog, or the movie Vertigo — which would remind it, in turn, of San Francisco’s iconic Golden Gate Bridge.

Golden Gate Claude was more of a mental illness than a practical tool. (It reminded me of Vertigo in more than one sense.) But making the model obsessed with a specific landmark was a deliberate simplification for dramatic effect. Anthropic’s map of their model had a lot more nuance available, if it had been needed.

From “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet”

The image above is map of conceptual space based on neurons in a single model. You could think of it as a map of a single mind, or (if you prefer) a single pattern of language use. But models also differ from each other, and there’s no reason why we need to be limited to considering one at a time.

Academics working in this space (both in computer science and in the social sciences) are increasingly interested in using language models to explore cultural pluralism. As we develop models trained on different genres, languages, or historical periods, these models could start to function as reference points in a larger space of cultural possibility that represents the differences between maps like the one above. It ought to be possible to compare different modes of thought, edit them, and create new adjectives (like style references) to describe directions in cultural space.

If we can map cultural space, could we also discover genuinely new cultural forms and new ways of thinking? I don’t know why not. When writing was first invented it was, obviously, a passive echo of speech. “It is like a picture, which can give no answer to a question, and has only a deceitful likeness of a living creature” (Phaedrus).

But externalizing language, and fixing it in written marks, eventually allowed us to construct new genres (the scientific paper, the novel, the index) that required more sustained attention or more mobility of reference than the spoken word could support. Models of culture should similarly allow us to explore a new space of human possibility by stabilizing points of reference within it.

Wait, is editing culture a good idea?

“A new ability to map and explore different modes of reasoning” may initially sound less useful than a super-intelligence that just produces the right answer to our problems. And, look, I’m not suggesting that mapping culture is the only upside of AI. Drug discovery sounds great too! But if you believe human conflict is our most important and durable problem, then a technology that could improve human self-understanding might eventually outweigh a million new drugs.

I don’t mean that language models will eliminate conflict. I said above that conflict is a load-bearing part of society. And language models are likely to be used as ideological weapons—just as pamphlets were, after printing made them possible. But an ideological weapon that can be quizzed and compared to other ideologies implies a level of putting-cards-on-the-table beyond what we often get in practice now. There is at least a chance, as we have seen with Grok, that people who try to lie with an interactive model will end up exposing their own dishonest habits of thought.

So what does this mean concretely? Will maps of cultural space ever be as valuable economically as personal assistants who can answer your email and sound like Scarlett Johansson?

Probably not. Using language models to explore cultural space may not be a great short-term investment opportunity. It’s like — what would be a good analogy? A little piece of amber that, weirdly, attracts silk after rubbing. Or, you know, a little wheel containing water that spins when you heat it and steam comes out the spouts. In short, it is a curiosity that some of us will find intriguing because we don’t yet understand how it works. But if you’re like me, that’s the best upside imaginable for a new technology.

Wait. You’ve suggested that “mapping and editing culture” is a potential upside of AI, allowing us to explore a new space of human potential. But couldn’t this power be misused? What if the builders of Grok don’t “expose their own dishonesty,” but successfully manipulate the broader culture?

Yep, that could happen. I stressed earlier that this post was not going to try to weigh costs against benefits, because I don’t know—and I don’t think anyone knows—how this will play out. My goal here was “purely to sharpen our sense of the potential upside of cultural technology,” and help “specify what we would need to do to make it exciting.” I’m trying to explain a particular challenge and show how the balance could swing back and forth on it. A guarantee that things will, in fact, play out for the best is not something I would pretend to offer.

A future where human beings have the ability to map and edit culture could be very dark. But I don’t think it will be boring or passively depressing. If we think this sounds boring, we’re not thinking hard enough.

References

Altman, S. (2025, June 10). The Gentle Singularity. Retrieved July 2, 2025, from https://blog.samaltman.com/the-gentle-singularity blog.samaltman.com

Anthropic Interpretability Team. (2024, April). Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet. Transformer Circuits Thread. Retrieved July 2, 2025, from https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html

Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (pp. 610–623). Association for Computing Machinery. https://doi.org/10.1145/3442188.3445922 researchgate.net

Farrell, H., Gopnik, A., Shalizi, C., & Evans, J. (2025, March 14). Large AI models are cultural and social technologies. Science, 387(6739), 1153–1156. https://doi.org/10.1126/science.adt9819 pubmed.ncbi.nlm.nih.gov

Manning, C. D. (2022). Human language understanding & reasoning. Daedalus, 151(2), 127–138. https://doi.org/10.1162/daed_a_01905 virtual-routes.org

Mollick, E. (2024). Co-Intelligence: Living and working with AI. Penguin Random House. penguinrandomhouse.com

Narayanan, A., & Kapoor, S. (2025, April 15). AI as normal technology: An alternative to the vision of AI as a potential superintelligence. Knight First Amendment Institute. kfai-documents.s3.amazonaws.com

Sorensen, T., Moore, J., Fisher, J., Gordon, M., Mireshghallah, N., Rytting, C. M., Ye, A., Jiang, L., Lu, X., Dziri, N., Althoff, T., & Choi, Y. (2024). A roadmap to pluralistic alignment. arXiv. https://arxiv.org/abs/2402.05070

Standard Ebooks. (2022). Phaedrus (B. Jowett, Trans.). Retrieved July 2, 2025, from https://standardebooks.org/ebooks/plato/dialogues/benjamin-jowett/text/single-page

Varnum, M. E. W., Baumard, N., Atari, M., & Gray, K. (2024). Large language models based on historical text could offer informative tools for behavioral science. Proceedings of the National Academy of Sciences of the United States of America, 121(42), e2407639121. https://doi.org/10.1073/pnas.2407639121

Wilder, B. (2025). Equilibrium effects of LLM reviewing. Retrieved July 2, 2025, from https://bryanwilder.github.io/files/llmreviews.html bryanwilder.github.io

Categories
artificial intelligence machine learning Uncategorized

Should artificial intelligence be person-shaped?

Since Mary Shelley, writers of science fiction have enjoyed musing about the moral dilemmas created by artificial persons.

I haven’t, to be honest. I used to insist on the term “machine learning,” because I wanted to focus on what the technology actually does: model data. Questions about personhood and “alignment” felt like anthropocentric distractions, verging on woo.

But these days woo is hard to avoid. OpenAI is now explicitly marketing the promise that AI will cross the uncanny valley and start to sound like a person. The whole point of the GPT-4o demo was to show off the human-sounding (and um, gendered) expressiveness of a new model’s voice. If there had been any doubt about the goal, Sam Altman’s one-word tweet “her” removed it.

Mira Murati, Mark Chen, and Barret Zoph at the GPT-4o demo.

At the other end of the spectrum lies Apple, which seems to be working hard to avoid any suggestion that the artificial intelligence in their products could coalesce into an entity. The phrase “Apple Intelligence” has a lot of advantages, but one of them is that it doesn’t take a determiner. It’s Apple Intelligence, not “an apple intelligence.” Apple’s conception of this feature is more like an operating system — diffuse and unobtrusive — just a transparent interface to the apps, schedules, and human conversations contained on your phone.

Craig Federighi at WWDC ’24. If you look closely, Apple Intelligence includes “a more personal Siri.” But if you look even closer, the point is not that Siri has more personhood but that it will better understand yours (e.g., when your mother’s flight arrives).

If OpenAI is obsessed with Her, Apple Intelligence looks more like a Caddy from All the Birds in the Sky. In Charlie Jane Anders’ novel, Caddies are mobile devices that quietly guide their users with reminders and suggestions (restaurants you might like, friends who happen to be nearby, and so on). A Caddy doesn’t need an expressive voice, because it’s a service rather than a separate person. In All the Birds, Patricia starts to feel it’s “an extension of her personality” (173).

There are a lot of reasons to prefer Apple’s approach. Putting the customer at the center of a sales pitch is usually a smart move. Ben Evans also argues that users will understand the limitations of AI better if it’s integrated into interfaces that provide a specific service rather than presented as an open-ended chatbot.

Moreover, Apple’s approach avoids several kinds of cringe invited by OpenAI’s demo — from the creepily-gendered Pygmalion vibe to the more general problem that we don’t know how to react to the laughter of a creature that doesn’t feel emotion. (Readers of Neuromancer may remember how much Case hates “the laugh that wasn’t laughter” emitted by the recording of his former teacher, the Dixie Flatline.)

Finally, impersonal AI is calculated to please grumpy abstract thinkers like me, who find a fixation on so-called “general” intelligence annoyingly anthropocentric.

However. Let’s look at the flip side for a second.

The most interesting case I’ve heard for person-shaped AI was offered last week by Amanda Askell, a philosopher working at Anthropic. In an interview with Stuart Richie, Askell argues that AI needs a personality for two reasons. First, shaping a personality is how we endow models with flexible principles that will “determine how [they] react to new and difficult situations.” Personality, in other words, is simply how we reason about character. Second, personality signals to users that they’re not talking to an omniscient oracle.

“We want people to know that they’re interacting with a language model and not a person. But we also want them to know they’re interacting with an imperfect entity with its own biases and with a disposition towards some opinions more than others. Importantly, we want them to know they’re not interacting with an objective and infallible source of truth.”

It’s a good argument. One has to approach it skeptically, because there are several other profitable reasons for companies to give their products a primate-shaped UI. It provides “a more interesting user experience,” as Askell admits — and possibly a site of parasocial attachment. (OpenAI’s Sky voice sounded a bit like Scarlett Johansson.) Plus, human behavior is just something we know how to interpret. I often prefer to interact with ChatGPT in voice mode, not only because it leaves my hands and eyes free, but because it gives the model an extra set of ways to direct my attention — ranging from emphasis to, uh, theatrical pauses that signal a new or difficult topic.

But this ends up sending us back to Askell’s argument. Even if models are not people, maybe we need the mask of personality to understand them? A human-sounding interface provides both simple auditory signals and epistemic signals of bias and limitation. Suppressing those signals is not necessarily more honest. It may be relevant here that the impersonal transparency of the Caddies in All the Birds in the Sky turns out to be a lie. No spoilers, but the Caddies actually have an agenda, and are using those neutral notifications and reminders to steer their human owners. It wouldn’t be shocking if corporate interfaces did the same thing.

So, should we anthropomorphize AI? I think it’s a much harder question than is commonly assumed, and maybe not a question that can be answered at all. Apple and Anthropic are selling different products, to different audiences. There’s no reason one of them has to be wrong.

On Bluesky, Dave Palfrey reminds me that the etymology of “person” leads back through “fictional character” to “mask.”

More fundamentally, this is a hard question because it’s not clear that we’re telling the full truth when we anthropomorphize people. Writers and critics have been arguing for a long time that the personality of the author is a mask. As Stéphane Mallarmé puts it, “the pure work implies the disappearance of the poet speaking, who yields the initiative to words” (208). There’s a sense in which all of us are language models. “How do I know what I think until I see what I say?”

This shoggoth could also be captioned “language,” and the mask could be captioned “personality.” Authorship of the image not 100% clear; see the full history of this meme.

So if we feel creeped out by all the interfaces for artificial intelligence — both those that pretend to be neutrally helpful and those that pretend to laugh at our jokes — the reason may be that this dilemma reminds us of something slightly cringe and theatrical about personality itself. Our selves are not bedrock atomic realities; they’re shaped by collective culture, and the autonomy we like to project is mostly a fiction. But it’s also a necessary fiction. Projecting and perceiving personality is how we reason about questions of character and perspective, and we may end up trusting models more if they can play the same game. Even if we flinch a little every time they laugh.

References

Anders, Charlie Jane. All the Birds in the Sky. Tor, 2016.

Askell, Amanda and Richie, Stuart. “What should an AI’s personality be?” Anthropic blog. June 8, 2024.

Gibson, William. Neuromancer. Ace, 1984.

Mallarmé, Stéphane. “The Crisis of Verse.” In Divagations, trans Barbara Johnson. Harvard University Press, 2007.

Warner Bros. Picture presents an Annapurna Pictures production; produced by Megan Ellison, Spike Jonze, Vincent Landay; written and directed by Spike Jonze. Her. Burbank, CA: Distributed by Warner Home Video, 2014.

Categories
19c 20c cultural analytics deep learning fiction plot plot

Can language models predict the next twist in a story?

While distant reading has taught us a lot about the history of fiction, it hasn’t done much yet to explain why we keep turning pages.

“Suspense” is the word we use to explain that impulse. But what is suspense? Does it require actual anxiety, or just uncertainty about what happens next? If suspense depends on not knowing what will happen, how can we enjoy re-reading familiar books? (Smuts 2009) And why do we enjoy being surprised? (Tobin 2018)

Beyond these big theoretical puzzles, there are historical questions scholars might like to ask about the way authors use chapter breaks to structure narrative revelation (Dames 2023, 219-38).

Right now, distant reading can’t fully answer any of these questions. When we want to measure surprise or novelty, for instance, we typically measure change in the verbal texture of a story from beginning to end. I made a coarse attempt of that kind in a blog post a few years ago. Other articles use better methods, and give us new ways to think about form (McGrath et al. 2018, Piper et al. 2023). But how closely does the pace of verbal change correlate with readers’ experience of uncertainty or surprise? We don’t know.

Autoregressive language models offer a tempting new angle on this problem, because they’re trained specifically to predict how a given text will continue. Intuitively, it feels like we could measure the predictability of a plot by first asking a model to continue the story, and then measuring the divergence between predicted continuation and real text. Even if this isn’t exactly how readers form expectations and experience surprise, it might begin to give us some leverage on the question.

Researchers have run a loosely similar experiment on very short stories contributed by experimental subjects (Sap et. al 2020). But scaling that up to published novels poses a challenge. For one thing, language models may not be equally good at imitating every style. A contemporary model’s failure to predict the next sentence by Jane Austen might just mean that it’s bad at channeling the Regency.

So, to factor style out of the question, let’s ask a model to predict what will happen in, say, the next three pages of a story — and then compare those predictions to its own summaries of the pages when it sees them.

Readers of a certain age will recognize this as a game Ernie invites Bert to play on “Sesame Street.”

Ernie asks Bert “what happens next” in this picture. Bert anticipates that the man will step in the pail, and disaster will ensue.

To spell the method out more precisely: we move through a novel roughly 900 words at a time. On each pass, we give a language model both a recap of earlier events, and a new 900-word passage. We ask the model to summarize the new passage, and also ask it to predict what will happen next. Then we compare its prediction to the summary it generates when it actually sees the next passage, and measure cosine distance between the two sentence embeddings. A large distance means the model did a poor job of predicting “what would happen next.”

Does this have any relation to human uncertainty?

I’m not claiming that this is a good model of the way readers experience plot. We don’t have a good model of that yet! The more appropriate question to ask is: Does this correlate at all with anything human readers do?

We can check by asking a reader to do the same thing: read roughly 900-word passages and make predictions about the future. Then we can compare the human reader’s predictions to automatically-generated summaries.

Passages were drawn from Now in November, by Josephine Johnson, and Murder is Dangerous, by Saul Levinson. n = 51 passages, Pearson’s r = .41, p < .01. Human predictions are more variable in quality than the model’s.

When I did this for two novels that were complete blanks to me, my predictions tended to diverge from the actual course of the story in roughly the same places where the model found prediction difficult. So there does seem to be some relationship between a language model’s (in)ability to see what’s coming and a human reader’s.

The image above also reveals that there are broad, consistent differences between books. For both people and models, some stories are easier to predict than others.

A reason not to trust this

Readers of this story may already anticipate the next twist—which is that of course we shouldn’t use LLMs to study uncertainty, because these models have already read many of the books we’re studying and will (presumably) already know the plot.

This is a particularly nasty problem because we don’t have a list of the books commercial models were trained on. We’re flying blind. But before we give up, let’s test how much of a problem this really poses. Researchers at Berkeley have defined a convenient test of the extent to which a model has memorized a book (Chang et al. 2023). In essence, they ask the model to fill in missing names.

Running this test, Chang et al. find that GPT-4 remembers many books in detail. Moreover, its ability to fill in masked names correlates with its accuracy on certain other tasks—like its ability to estimate date of publication. This could be a problem for questions about plot.

To avoid this problem (and also save money), I’ve been using GPT-3.5, which Chang et al. find is less prone to memorize books. But is that enough to address the problem? Let’s check. Below I’ve plotted the average divergence between prediction and summary for 25 novels on the y axis, and GPT-3.5’s ability to supply masked names in those texts on the x axis. If memorization was making prediction more accurate, we would expect to see a negative correlation: predictions’ divergence from summaries should go down as name_cloze accuracy goes up.

The y axis is average cosine distance between prediction and summary; x axis is GPT-3.5’s accuracy on the name cloze test defined in Chang et al.

25 books is not enough for a conclusive answer, but so far I cannot measure any pattern of that kind. (If anything, there is a faint trend in the opposite direction.)

In an ideal world, researchers would use language models trained on open data sets that they know and control. But until we get to an ideal world, it looks like it may be possible to run proof-of-concept experiments with things like GPT-3.5, at least if we avoid extremely famous books.

Scrutinizing the image above, readers will probably notice that the most predictable book in this sample was Zoya, by Danielle Steel. Although Steel has a reputation that may encourage disparaging inferences — see Dan Sinykin, Big Fiction, for why — I don’t think we’re in a position to draw those inferences yet. The local rhythms that make prediction possible across three pages are not necessarily what critics mean when they use “predictable” to diss a book.

So what could we learn from predicting the next three pages?

To consider one possible payoff: it might give us a handle on the way chapter-breaks, and other divides, structure the epistemic rhythms of fiction. For instance, many readers have noticed that the installments of novels originally published in magazines tend to end with an explicit mystery to ensure that you keep reading (Haugtvedt 2016 and Beekman 2017). In the first installment of Arthur Conan Doyle’s Hound of the Baskervilles (which covers two chapters), Watson and Holmes learn about a legendary curse that connects the family of the Baskervilles to a fiendish hound. In the final lines of the first installment, Holmes asks the family doctor about footprints found near the body of Sir Charles Baskerville. “A man’s or a woman’s?” Holmes asks. The doctor’s “voice sank almost to a whisper as he answered. ‘Mr Holmes, they were the footprints of a gigantic hound!'” End installment.

Novels don’t have soundtracks. But unexplained, suggestive new information is as good as a sting: “Bum – bum – BUM!”

The Hound of the Baskervilles, by Sidney Paget, 1902.

It appears that we can measure this cliffhanger effect: the serial installments of The Hound of the Baskervilles often end with a moment of heightened mystery — at least, if inability to predict the next three pages is any measure of mystery. When we measure predictive accuracy throughout the story, making four different passes to ensure we have roughly-900-word chunks aligned with all the chapter breaks, we find that predictions are farther from reality at the ends of serial installments. There is no similar effect at other chapter breaks.

The mean for breaks at serial installments is more than one standard deviation above the mean for other chapter-breaks. In spite of tiny n, this is actually p < .05.

Now, this is admittedly a cherry-picked example. So far, I have only looked at seven novels where we can distinguish the ends of serial installments from other kinds of chapter break (using data from Warhol et al). And I don’t see this pattern in all of them.

So I’m not yet making any historical argument about serialization and the rise of the cliffhanger. I’m just suggesting that it’s the kind of question someone could eventually address using this method. A doctoral student could do it, for instance, with a locally hosted model. (I don’t recommend doing it with GPT-3.5, because I dropped $150 or so on this post, and that might add up across a dissertation.) Some initial tests suggest to me that this approach will produce results significantly different than we’re getting with lexical methods.

Since I’m explicitly encouraging people to run with this, let me also say that someone actually writing a paper using this method might want to tinker with several things before trusting it:

  1. Measuring the distance between the embedding of one prediction sentence and one summary sentence is a crude way to measure expectation and surprise. Readers don’t necessarily form a single expectation about plot. Maybe it would be better to model expectation as a range of possibility?
  2. Related to this: models may need to be nudged to speculate and not just predict that current actions will continue.
  3. 900-word chunks may not be the only appropriate scale of analysis. When readers talk about narrative surprise they’re often thinking about larger arcs like “who will he marry?” or “who turns out to be the murderer?”
  4. We need a way to handle braided narratives where each chapter is devoted to a different group of characters (Garrett 1980). In a multi-plot story, the B or C plot will often not continue across a chapter break.

But we’re in a multi-plot narrative ourselves, so those problems may be solved by a different group of characters. This was just a blog post to share an idea and get people arguing about it. Tune in next time, for our thrilling conclusion. (Bum – bum – BUM!)

Code and data used for this post are available on Github.

The ideas discussed here were previously presented in Paris at a workshop on AI for the analysis of literary corpora, and in Copenhagen in a conference on generative methods in the social sciences and humanities. I’d like to thank the organizers of those events, esp. Thierry Poibeau, Anders Munk, and Rolf Lund, for stimulating conversation — and also many people in attendance, especially David Bamman, Lynn Cherny, and Meredith Martin. In writing code to query the OpenAI API, I borrowed snippets from Quinn Dombrowski (and also of course from GPT-4 itself oh brave new world &c). My thinking about 19c serialization was advanced by conversation with David Bishop and Eleanor Courtemanche, and by suggestions from Ryan Cordell and Elizabeth Foxwell on Bluesky.

References

Beekman, G. (2017) “Emotional Density, Suspense, and the Serialization of The Woman in White in All theYear Round.” Victorian Periodicals Review 50.1.

Chang, K., Cramer, M., Soni, S., & Bamman, D. (2023) “Speak, Memory: An Archaeology of Books known to ChatGPT / GPT-4,” https://arxiv.org/abs/2305.00118.

Dames, N. (2023) The Chapter: A Segmented History from Antiquity to the Twenty-First Century. Princeton University Press.

Garrett, P. (1980) The Victorian Multiplot Novel: Studies in Dialogical Form. Yale University Press.

Haugtvedt, E. (2016) “The Sympathy of Suspense: Gaskell and Braddon’s Slow and Fast Sensation Fiction in Family Magazines.” Victorian Periodicals Review 49.1.

McGrath, L., Higgins, D., & Hintze, A. (2018) “Measuring Modernist Novelty.” Journal of Cultural Analytics. https://culturalanalytics.org/article/11030-measuring-modernist-novelty

Piper, A., Xu, H., & Kolaczyk, E. D. (2023) “Modeling Narrative Revelation.” Computational Humanities Research 2023. https://ceur-ws.org/Vol-3558/paper6166.pdf

Sap, M., Horvitz, E., Choi, Y., Smith, N. A., & Pennebaker, J. (2020) “Recollection Versus Imagination: Exploring Human Memory and Cognition via Neural Language Models.” Proceedings of the 58th Annual Meeting of the ACL. https://aclanthology.org/2020.acl-main.178/

Sinykin, Dan. Big Fiction: How Conglomeration Changed the Publishing Industry and American Literature. Columbia University Press, 2023.

Smuts, A. (2009) “The Paradox of Suspense.” Stanford Encyclopedia of Philosophy. https://plato.stanford.edu/entries/paradox-suspense/#SusSur

Tobin, V. (2018) The Elements of Surprise: Our Mental Limits and the Satisfactions of Plot. Harvard University Press.

Warhol, Robyn, et al., “Reading Like a Victorian,” https://readinglikeavictorian.osu.edu/.