Categories
artificial intelligence social effects of machine learning

A more interesting upside of AI

My friends are generally optimistic, forward-looking people, but talking about AI makes many of them depressed. Either AI is scarier than other technologies, or public conversation about it has failed them in some way. Or both.

I think the problem is not just that people have legitimate concerns. What’s weird and depressing about AI discourse right now is the surprising void where we might expect to find a counterbalancing positive vision.

It is not unusual, after all, for new technologies to have a downside. Airplanes were immediately recognized as weapons of war, and we eventually recognized that the CO2 they produce is not great either. But their upside is also vivid: a new and dizzying freedom of motion through three-dimensional space. Is the upside worth the cost? Idk. Saturation bombing has been bad for civilians. But there is at least something in both pans of the scale. It can swing back and forth. So when we think about flight—even if we believe it has been destructive on balance—we can see tension and a possibility of change. We don’t just feel passively depressed.

Is “super-intelligence” the upside for AI?

What people seem to want to put on the “positive” side of the balance for AI is is a 1930s-era dystopia skewered well by Helen De Cruz.

The upside of AI, apparently, is that super-intelligence replaces human agency. This is supposed to have good consequences: accelerating science and so on. But it’s not exactly motivating, because it’s not clear what we get to do in this picture.

Sam Altman’s latest blog post (“The Gentle Singularity”) reassures us by telling us we will always feel there’s something to do. “I hope we will look at the jobs a thousand years in the future and think they are very fake jobs, and I have no doubt they will feel incredibly important and satisfying to the people doing them.”

If this is the upside on offer, I’m not surprised people are bored and depressed. First, it’s an unappealing story, as Ryan Moulton explains:

Secondly, as Ryan hints in his last sentence, Altman’s vision of the future isn’t very persuasive. Stories about fully automated societies where superintelligent AI makes the strategic decisions, coordinates the supply chains, &c, quietly assume that we can solve “alignment” not only for models but for human beings. Bipedal primates are expected to naturally converge on a system that allows decisions to be made by whatever agency is smartest or produces best results. Some version of this future has sounded rational and plausible to nerds since Plato. But somehow we nerds consistently fail to make it reality—in spite of our presumably impressive intelligence. If you want a vision of the future of “super-intelligence,” consider the fate of the open web or the NSF.

I’m not just giving a fatalistic shrug about politics and markets here. I think cutting the NSF was a bad idea, but there are good reasons why we keep failing to eliminate human disagreement. It’s a load-bearing part of the system. If you or I tried to automate, say, NSF review panels, our automated system would develop blind spots, and would eventually need external criticism. (For a good explanation of why, see Bryan Wilder’s blog post on automated review.) Conceivably that external criticism could be provided by AI. But if so, the AI would need to be built or controlled by someone else—someone independent of us. If a task is really important, you need legal persons who can’t edit or delete each other arguing over it.

AI as a cultural technology

The irreducible centrality of human conflict is one reason why I doubt that “super-intelligence” is the right frame for thinking about economic and social effects of AI. However smart it gets, a system that lacks independent legal personhood is not a good substitute for a human reviewer or manager. Nor do I think it’s likely that fractious human polities will redefine legal personhood so it can be multiplied by hitting command-C followed by command-V.

A more coherent positive picture of a future with AI has started to emerge. As the title of Ethan Mollick’s Co-Intelligence implies, it tends to involve working with AI assistance, not resigning large sectors of the economy to a super-intelligence. I’ve outlined one reason to expect that path above. Arvind Narayanan and Sayash Kapoor have provided a more sustained argument that AI capability is unlikely to exponentially exceed human capability across a wide range of tasks.

One reason they don’t expect that trajectory is that the recent history of AI has not tended to support assumptions about the power of abstract and perfectly generalizable intelligence. Progress was rapid over the last ten years—but not because we first discovered the protean core of intelligence itself, which then made all merely specific skills possible. Instead, models started with vast, diverse corpora of images and texts, and learned how to imitate them. This approach was frustrating enough to some researchers that they dismissed the new models as mere “parrots,” entities that fraudulently feign intelligence by learning a huge repertoire of specific templates.

A somewhat more positive response to this breakthrough, embraced by Henry Farrell, Alison Gopnik, Cosma Shalizi, and James Evans, has been to characterize AI as a “cultural technology.” Cultural technologies work by transmitting and enacting patterns of behavior. Other examples might include libraries, the printing press, or language itself.

Is this new cultural technology just a false start or a consolation prize in a hypothetical race whose real goal is still the protean core of intelligence? Many AI researchers seem to think so. The term “AGI” is often used. Some researchers, like Yann LeCun, argue that getting to AGI will require a radically different approach. Others suspect that transformer models or diffusion models can do it with sufficient scale.

I don’t know who’s right. But I also don’t care very much. I’m not certain I believe in absolutely general intelligence—and I know that I don’t believe culture is a less valuable substitute for it.

On the contrary. I’m fond of Christopher Manning’s observation that, where sheer intelligence is concerned, human beings are not orders of magnitude different from bonobos. What gave us orders of magnitude greater power to transform this planet, for good or ill, was language. Language vastly magnified our ability to coordinate patterns of collective behavior (culture), and transmit those patterns to our descendants. Writing made cultural patterns even more durable. Now generative language models (and image and sound models) represent another step change in our ability to externalize and manipulate culture.

Why “cultural technology” doesn’t make anyone less depressed

I’ve suggested that a realistic, potentially positive vision of AI has started to coalesce. It involves working with AI as a “normal technology” (Naranayan and Kapoor), one in a long sequence of “cultural technologies” (Gopnik, Farrell, et al) that have extended the collective power of human beings.

So why are my friends still depressed?

Well, if they think the negative consequences of AI will outweigh positive effects, they have every right to be depressed, because no one has proven that’s wrong. It is absolutely still possible that AI will displace people from existing jobs, force retraining, increase concentration of power, and (further) destabilize democracy. I don’t think anyone can prove that the upside of AI outweighs those possible downsides. The cultural and political consequences of technology are hard to predict, and we are not generally able to foresee them at this stage.

French aviator Louis Paulhan flying over Los Angeles in 1910. Library of Congress.

But as I hinted at the beginning of this post, I’m not trying to determine whether AI is good or bad on balance. It can be hard to reach consensus about that, even with a technology as mature as internal combustion or flight. And, even with very mature technologies, it tends to be more useful to try to change the balance of effects than to debate whether it’s currently positive.

So the goal of this post is not to weigh costs and benefits, or argue with skeptics. It is purely to sharpen our sense of the potential upside latent in a vision of AI as “cultural technology.” I think one reason that phrase hasn’t cheered anyone up is that it has been perceived as a deflating move, not an inspiring one. The people disposed to be interested in AI mostly got hooked by a rather different story, about protean general intelligences closely analogous to human beings. If you tell AI enthusiasts “no, this is more like writing,” they tend to get as depressed as the skeptics. My goal here is to convince people who use AI that “this is like writing” is potentially exciting—and to specify some of the things we would need to do to make it exciting.

Mapping and editing culture

So what’s great about writing? It is more durable than the spoken word, of course. But just as importantly, writing allows us to take a step back from language, survey it, fine-tune it, and construct complex structures where one text argues with two others, each of which footnotes fifty others. It would be hard to imagine science without the ability writing provides to survey language from above and use it as building material.

Generative AI represents a second step change in our ability to map and edit culture. Now we can manipulate, not only specific texts and images, but the dispositions, tropes, genres, habits of thought, and patterns of interaction that create them. I don’t think we’ve fully grasped yet what this could mean.

I’ll try to sketch one set of possibilities. But I know in advance that I will fail here, just as someone in 1470 would have failed to envision the scariest and most interesting consequences of printing. At least maybe we’ll get to the point of seeing that it’s not mostly “cheaper Bibles.”

Here’s a Bluesky post that used generative AI to map the space of potential visual styles using a “style reference” (sref).

In the early days of text-to-image models, special phrases were passed around like magic words. Adding “Unreal Engine” or “HD” or “by James Gurney” produced specific stylistic effects. But the universe of possible styles is larger than a dictionary of media, artistic schools, or even artists’ names can begin to cover. If we had a way to map that universe, we could explore blank spaces on the map.

Midjourney invented “style references” as simple way to do that. You create a reference by choosing between pairs of images that represent opposing vectors in a stylistic plane. In the process of making those choices, you construct your own high-dimensional vector. Once you have a code for the vector, you can use it as an newly invented adjective, and dial its effect up or down.

A map with just a touch of the style shared above by Susan Fallow, which adds (among other things) traces that look like gilding. (“an ancient parchment map showing the coastline of an imagined world –sref 321312992 –sv 4 –sw 20 –ar 2:1“)

“Style references” are modest things, of course. But if we can map a space of possibility for image models, and invent new adjectives to describe directions in that space, we should be able to do the same for language models.

And “style” is not the only thing we could map. In exploring language, it seems likely that we will be mapping different ways of thinking. The experiment called “Golden Gate Claude” was an early, crude demonstration of what this might mean. Anthropic mapped the effect of different neurons in a model, and used that knowledge to tune up one neuron and create a version of Claude deeply obsessed with the Golden Gate Bridge. Given any topic, it would eventually bring the conversation around to fog, or the movie Vertigo — which would remind it, in turn, of San Francisco’s iconic Golden Gate Bridge.

Golden Gate Claude was more of a mental illness than a practical tool. (It reminded me of Vertigo in more than one sense.) But making the model obsessed with a specific landmark was a deliberate simplification for dramatic effect. Anthropic’s map of their model had a lot more nuance available, if it had been needed.

From “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet”

The image above is map of conceptual space based on neurons in a single model. You could think of it as a map of a single mind, or (if you prefer) a single pattern of language use. But models also differ from each other, and there’s no reason why we need to be limited to considering one at a time.

Academics working in this space (both in computer science and in the social sciences) are increasingly interested in using language models to explore cultural pluralism. As we develop models trained on different genres, languages, or historical periods, these models could start to function as reference points in a larger space of cultural possibility that represents the differences between maps like the one above. It ought to be possible to compare different modes of thought, edit them, and create new adjectives (like style references) to describe directions in cultural space.

If we can map cultural space, could we also discover genuinely new cultural forms and new ways of thinking? I don’t know why not. When writing was first invented it was, obviously, a passive echo of speech. “It is like a picture, which can give no answer to a question, and has only a deceitful likeness of a living creature” (Phaedrus).

But externalizing language, and fixing it in written marks, eventually allowed us to construct new genres (the scientific paper, the novel, the index) that required more sustained attention or more mobility of reference than the spoken word could support. Models of culture should similarly allow us to explore a new space of human possibility by stabilizing points of reference within it.

Wait, is editing culture a good idea?

“A new ability to map and explore different modes of reasoning” may initially sound less useful than a super-intelligence that just produces the right answer to our problems. And, look, I’m not suggesting that mapping culture is the only upside of AI. Drug discovery sounds great too! But if you believe human conflict is our most important and durable problem, then a technology that could improve human self-understanding might eventually outweigh a million new drugs.

I don’t mean that language models will eliminate conflict. I said above that conflict is a load-bearing part of society. And language models are likely to be used as ideological weapons—just as pamphlets were, after printing made them possible. But an ideological weapon that can be quizzed and compared to other ideologies implies a level of putting-cards-on-the-table beyond what we often get in practice now. There is at least a chance, as we have seen with Grok, that people who try to lie with an interactive model will end up exposing their own dishonest habits of thought.

So what does this mean concretely? Will maps of cultural space ever be as valuable economically as personal assistants who can answer your email and sound like Scarlett Johansson?

Probably not. Using language models to explore cultural space may not be a great short-term investment opportunity. It’s like — what would be a good analogy? A little piece of amber that, weirdly, attracts silk after rubbing. Or, you know, a little wheel containing water that spins when you heat it and steam comes out the spouts. In short, it is a curiosity that some of us will find intriguing because we don’t yet understand how it works. But if you’re like me, that’s the best upside imaginable for a new technology.

Wait. You’ve suggested that “mapping and editing culture” is a potential upside of AI, allowing us to explore a new space of human potential. But couldn’t this power be misused? What if the builders of Grok don’t “expose their own dishonesty,” but successfully manipulate the broader culture?

Yep, that could happen. I stressed earlier that this post was not going to try to weigh costs against benefits, because I don’t know—and I don’t think anyone knows—how this will play out. My goal here was “purely to sharpen our sense of the potential upside of cultural technology,” and help “specify what we would need to do to make it exciting.” I’m trying to explain a particular challenge and show how the balance could swing back and forth on it. A guarantee that things will, in fact, play out for the best is not something I would pretend to offer.

A future where human beings have the ability to map and edit culture could be very dark. But I don’t think it will be boring or passively depressing. If we think this sounds boring, we’re not thinking hard enough.

References

Altman, S. (2025, June 10). The Gentle Singularity. Retrieved July 2, 2025, from https://blog.samaltman.com/the-gentle-singularity blog.samaltman.com

Anthropic Interpretability Team. (2024, April). Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet. Transformer Circuits Thread. Retrieved July 2, 2025, from https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html

Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (pp. 610–623). Association for Computing Machinery. https://doi.org/10.1145/3442188.3445922 researchgate.net

Farrell, H., Gopnik, A., Shalizi, C., & Evans, J. (2025, March 14). Large AI models are cultural and social technologies. Science, 387(6739), 1153–1156. https://doi.org/10.1126/science.adt9819 pubmed.ncbi.nlm.nih.gov

Manning, C. D. (2022). Human language understanding & reasoning. Daedalus, 151(2), 127–138. https://doi.org/10.1162/daed_a_01905 virtual-routes.org

Mollick, E. (2024). Co-Intelligence: Living and working with AI. Penguin Random House. penguinrandomhouse.com

Narayanan, A., & Kapoor, S. (2025, April 15). AI as normal technology: An alternative to the vision of AI as a potential superintelligence. Knight First Amendment Institute. kfai-documents.s3.amazonaws.com

Sorensen, T., Moore, J., Fisher, J., Gordon, M., Mireshghallah, N., Rytting, C. M., Ye, A., Jiang, L., Lu, X., Dziri, N., Althoff, T., & Choi, Y. (2024). A roadmap to pluralistic alignment. arXiv. https://arxiv.org/abs/2402.05070

Standard Ebooks. (2022). Phaedrus (B. Jowett, Trans.). Retrieved July 2, 2025, from https://standardebooks.org/ebooks/plato/dialogues/benjamin-jowett/text/single-page

Varnum, M. E. W., Baumard, N., Atari, M., & Gray, K. (2024). Large language models based on historical text could offer informative tools for behavioral science. Proceedings of the National Academy of Sciences of the United States of America, 121(42), e2407639121. https://doi.org/10.1073/pnas.2407639121

Wilder, B. (2025). Equilibrium effects of LLM reviewing. Retrieved July 2, 2025, from https://bryanwilder.github.io/files/llmreviews.html bryanwilder.github.io

Categories
artificial intelligence social effects of machine learning

Will AI make us overconfident?

Several times in the last year I’ve been surprised when students who I thought were still learning to program showed up to our meeting with a fully-formed (good) solution to a challenging research problem.

In part, this is just what it feels like to work with students. “Wait, I thought you couldn’t do that, and now you’re telling me you can do it? You somehow learned to do it?”

But there’s also a specific reason why students are surprising me more often lately. In each of these cases, when my eyes widened, the student minimized their own work by adding “oh, it’s easy with ChatGPT.”

I understand what they mean, because I’m having the same experience. As Simon Willison has put it, “AI-enhanced development makes me more ambitious with my projects.” The effect is especially intense if you code, because AI code completion is like wearing seven-league boots. But it’s not just for coding. Natural-language dialogue with a chatbot also helps me understand new concepts and choose between tools.

As in any fairytale, accepting magical assistance comes with risks. Chatbot advice has saved me several days on a project, but if you add up bugs and mistakes it has cost me at least a day too. And it’s hard to find bugs if you’re not the person who put them in! I won’t try to draw up a final balance sheet here, because a) as tools evolve, the balance will keep changing, and b) people have developed very strong priors on the topic of “AI hallucination” and I don’t expect to persuade readers to renounce them.

Instead, let’s agree to disagree about the ratio of “really helping” to “encouraging overconfidence.” What I want to say in this short post is just that, when it comes to education, even encouraging overconfidence can have positive effects. A lot of learning takes place when we find an accessible entry point to an actually daunting project.

I suspect this is how the internet encouraged so many of us to venture across disciplinary boundaries. I don’t have a study to prove it, but I guess I’ve lived it, because I considered switching disciplines twice in my career. The first attempt, in the 1990s, came to nothing because I started by going to the library and checking out an introductory textbook. Yikes: it had like five hundred pages. By 2010 things were different. Instead of starting with a textbook you could go straight to a problem you wanted to solve and Google around until you found a blog solving a related problem and — hmm — it says you have to download this program called “R,” but how hard can that be? just click here …

As in the story of “Stone Soup,” the ease was mostly deceptive. In fact I needed to back up and learn statistics in order to understand the numbers R was going to give me, and that took a few years more than I expected. But it’s still good for everyone that the interactive structure of the internet has created accessible entry points to difficult problems. Even if the problem is in fact still a daunting monolith … now it has a tempting front door.

In many ways, generative AI is just an extension of a project that the internet started thirty years ago: a project to reorganize knowledge interactively. Instead of starting by passively absorbing a textbook, we can can cobble together solutions from pieces we find in blogs and Stack Overflow threads and GitHub repos. Language models are extensions of this project, of course, in the literal sense that they were trained on Stack Overflow threads and GitHub repos! And I know people have a range of opinions about whether that training counts as fair use.

But let’s save the IP debate for another day. All I want to say right now is that the effects of generative AI are also likely to resemble the effects of the shift toward interactivity that began with the Web. In many cases, we’re not really getting a robot who can do a whole project for us. We’re getting an accessible front door, perhaps in the form of a HuggingFace page, plus a chatty guide who will accompany us inside the monolith and help us navigate the stairs. With this assistance, taking on unfamiliar tasks may feel less overwhelming. But the guide is fallible! So it’s still on us to understand our goal and recognize wrong turns.

Is the net effect of all this going to be good or bad? I doubt anyone knows. I certainly don’t. The point of this post is just to encourage us to reframe the problem a little. Our fears and fantasies about AI currently lean very heavily on a narrative frame where we “prompt it” to “do a task for us.” Sometimes we get angry—as with the notorious “Dear Sydney” ad—and insist that people still need to “do that for themselves.”

While that’s fair, I’d suggest flipping the script a little. Once we get past the silly-ad phase of adjusting to language models, the effect of this technology may actually be to encourage people to try doing more things for themselves. The risk is not necessarily that AI will make people passive; in fact, it could make some of us more ambitious. Encouraging ambition has upsides and downsides. But anyway, “automation” might not be exactly the right word for the process. I would lean instead toward stories about seven-league boots and animal sidekicks who provide unreliable advice.*


(* PS: For this reason, I think the narrator/sidekick in Robin Sloan’s Moonbound may provide a better parable about AI than most cyberpunk plots that hinge on sublime/terrifying/godlike singularities.)

Categories
artificial intelligence machine learning Uncategorized

Should artificial intelligence be person-shaped?

Since Mary Shelley, writers of science fiction have enjoyed musing about the moral dilemmas created by artificial persons.

I haven’t, to be honest. I used to insist on the term “machine learning,” because I wanted to focus on what the technology actually does: model data. Questions about personhood and “alignment” felt like anthropocentric distractions, verging on woo.

But these days woo is hard to avoid. OpenAI is now explicitly marketing the promise that AI will cross the uncanny valley and start to sound like a person. The whole point of the GPT-4o demo was to show off the human-sounding (and um, gendered) expressiveness of a new model’s voice. If there had been any doubt about the goal, Sam Altman’s one-word tweet “her” removed it.

Mira Murati, Mark Chen, and Barret Zoph at the GPT-4o demo.

At the other end of the spectrum lies Apple, which seems to be working hard to avoid any suggestion that the artificial intelligence in their products could coalesce into an entity. The phrase “Apple Intelligence” has a lot of advantages, but one of them is that it doesn’t take a determiner. It’s Apple Intelligence, not “an apple intelligence.” Apple’s conception of this feature is more like an operating system — diffuse and unobtrusive — just a transparent interface to the apps, schedules, and human conversations contained on your phone.

Craig Federighi at WWDC ’24. If you look closely, Apple Intelligence includes “a more personal Siri.” But if you look even closer, the point is not that Siri has more personhood but that it will better understand yours (e.g., when your mother’s flight arrives).

If OpenAI is obsessed with Her, Apple Intelligence looks more like a Caddy from All the Birds in the Sky. In Charlie Jane Anders’ novel, Caddies are mobile devices that quietly guide their users with reminders and suggestions (restaurants you might like, friends who happen to be nearby, and so on). A Caddy doesn’t need an expressive voice, because it’s a service rather than a separate person. In All the Birds, Patricia starts to feel it’s “an extension of her personality” (173).

There are a lot of reasons to prefer Apple’s approach. Putting the customer at the center of a sales pitch is usually a smart move. Ben Evans also argues that users will understand the limitations of AI better if it’s integrated into interfaces that provide a specific service rather than presented as an open-ended chatbot.

Moreover, Apple’s approach avoids several kinds of cringe invited by OpenAI’s demo — from the creepily-gendered Pygmalion vibe to the more general problem that we don’t know how to react to the laughter of a creature that doesn’t feel emotion. (Readers of Neuromancer may remember how much Case hates “the laugh that wasn’t laughter” emitted by the recording of his former teacher, the Dixie Flatline.)

Finally, impersonal AI is calculated to please grumpy abstract thinkers like me, who find a fixation on so-called “general” intelligence annoyingly anthropocentric.

However. Let’s look at the flip side for a second.

The most interesting case I’ve heard for person-shaped AI was offered last week by Amanda Askell, a philosopher working at Anthropic. In an interview with Stuart Richie, Askell argues that AI needs a personality for two reasons. First, shaping a personality is how we endow models with flexible principles that will “determine how [they] react to new and difficult situations.” Personality, in other words, is simply how we reason about character. Second, personality signals to users that they’re not talking to an omniscient oracle.

“We want people to know that they’re interacting with a language model and not a person. But we also want them to know they’re interacting with an imperfect entity with its own biases and with a disposition towards some opinions more than others. Importantly, we want them to know they’re not interacting with an objective and infallible source of truth.”

It’s a good argument. One has to approach it skeptically, because there are several other profitable reasons for companies to give their products a primate-shaped UI. It provides “a more interesting user experience,” as Askell admits — and possibly a site of parasocial attachment. (OpenAI’s Sky voice sounded a bit like Scarlett Johansson.) Plus, human behavior is just something we know how to interpret. I often prefer to interact with ChatGPT in voice mode, not only because it leaves my hands and eyes free, but because it gives the model an extra set of ways to direct my attention — ranging from emphasis to, uh, theatrical pauses that signal a new or difficult topic.

But this ends up sending us back to Askell’s argument. Even if models are not people, maybe we need the mask of personality to understand them? A human-sounding interface provides both simple auditory signals and epistemic signals of bias and limitation. Suppressing those signals is not necessarily more honest. It may be relevant here that the impersonal transparency of the Caddies in All the Birds in the Sky turns out to be a lie. No spoilers, but the Caddies actually have an agenda, and are using those neutral notifications and reminders to steer their human owners. It wouldn’t be shocking if corporate interfaces did the same thing.

So, should we anthropomorphize AI? I think it’s a much harder question than is commonly assumed, and maybe not a question that can be answered at all. Apple and Anthropic are selling different products, to different audiences. There’s no reason one of them has to be wrong.

On Bluesky, Dave Palfrey reminds me that the etymology of “person” leads back through “fictional character” to “mask.”

More fundamentally, this is a hard question because it’s not clear that we’re telling the full truth when we anthropomorphize people. Writers and critics have been arguing for a long time that the personality of the author is a mask. As Stéphane Mallarmé puts it, “the pure work implies the disappearance of the poet speaking, who yields the initiative to words” (208). There’s a sense in which all of us are language models. “How do I know what I think until I see what I say?”

This shoggoth could also be captioned “language,” and the mask could be captioned “personality.” Authorship of the image not 100% clear; see the full history of this meme.

So if we feel creeped out by all the interfaces for artificial intelligence — both those that pretend to be neutrally helpful and those that pretend to laugh at our jokes — the reason may be that this dilemma reminds us of something slightly cringe and theatrical about personality itself. Our selves are not bedrock atomic realities; they’re shaped by collective culture, and the autonomy we like to project is mostly a fiction. But it’s also a necessary fiction. Projecting and perceiving personality is how we reason about questions of character and perspective, and we may end up trusting models more if they can play the same game. Even if we flinch a little every time they laugh.

References

Anders, Charlie Jane. All the Birds in the Sky. Tor, 2016.

Askell, Amanda and Richie, Stuart. “What should an AI’s personality be?” Anthropic blog. June 8, 2024.

Gibson, William. Neuromancer. Ace, 1984.

Mallarmé, Stéphane. “The Crisis of Verse.” In Divagations, trans Barbara Johnson. Harvard University Press, 2007.

Warner Bros. Picture presents an Annapurna Pictures production; produced by Megan Ellison, Spike Jonze, Vincent Landay; written and directed by Spike Jonze. Her. Burbank, CA: Distributed by Warner Home Video, 2014.

Categories
19c 20c cultural analytics deep learning fiction plot plot

Can language models predict the next twist in a story?

While distant reading has taught us a lot about the history of fiction, it hasn’t done much yet to explain why we keep turning pages.

“Suspense” is the word we use to explain that impulse. But what is suspense? Does it require actual anxiety, or just uncertainty about what happens next? If suspense depends on not knowing what will happen, how can we enjoy re-reading familiar books? (Smuts 2009) And why do we enjoy being surprised? (Tobin 2018)

Beyond these big theoretical puzzles, there are historical questions scholars might like to ask about the way authors use chapter breaks to structure narrative revelation (Dames 2023, 219-38).

Right now, distant reading can’t fully answer any of these questions. When we want to measure surprise or novelty, for instance, we typically measure change in the verbal texture of a story from beginning to end. I made a coarse attempt of that kind in a blog post a few years ago. Other articles use better methods, and give us new ways to think about form (McGrath et al. 2018, Piper et al. 2023). But how closely does the pace of verbal change correlate with readers’ experience of uncertainty or surprise? We don’t know.

Autoregressive language models offer a tempting new angle on this problem, because they’re trained specifically to predict how a given text will continue. Intuitively, it feels like we could measure the predictability of a plot by first asking a model to continue the story, and then measuring the divergence between predicted continuation and real text. Even if this isn’t exactly how readers form expectations and experience surprise, it might begin to give us some leverage on the question.

Researchers have run a loosely similar experiment on very short stories contributed by experimental subjects (Sap et. al 2020). But scaling that up to published novels poses a challenge. For one thing, language models may not be equally good at imitating every style. A contemporary model’s failure to predict the next sentence by Jane Austen might just mean that it’s bad at channeling the Regency.

So, to factor style out of the question, let’s ask a model to predict what will happen in, say, the next three pages of a story — and then compare those predictions to its own summaries of the pages when it sees them.

Readers of a certain age will recognize this as a game Ernie invites Bert to play on “Sesame Street.”

Ernie asks Bert “what happens next” in this picture. Bert anticipates that the man will step in the pail, and disaster will ensue.

To spell the method out more precisely: we move through a novel roughly 900 words at a time. On each pass, we give a language model both a recap of earlier events, and a new 900-word passage. We ask the model to summarize the new passage, and also ask it to predict what will happen next. Then we compare its prediction to the summary it generates when it actually sees the next passage, and measure cosine distance between the two sentence embeddings. A large distance means the model did a poor job of predicting “what would happen next.”

Does this have any relation to human uncertainty?

I’m not claiming that this is a good model of the way readers experience plot. We don’t have a good model of that yet! The more appropriate question to ask is: Does this correlate at all with anything human readers do?

We can check by asking a reader to do the same thing: read roughly 900-word passages and make predictions about the future. Then we can compare the human reader’s predictions to automatically-generated summaries.

Passages were drawn from Now in November, by Josephine Johnson, and Murder is Dangerous, by Saul Levinson. n = 51 passages, Pearson’s r = .41, p < .01. Human predictions are more variable in quality than the model’s.

When I did this for two novels that were complete blanks to me, my predictions tended to diverge from the actual course of the story in roughly the same places where the model found prediction difficult. So there does seem to be some relationship between a language model’s (in)ability to see what’s coming and a human reader’s.

The image above also reveals that there are broad, consistent differences between books. For both people and models, some stories are easier to predict than others.

A reason not to trust this

Readers of this story may already anticipate the next twist—which is that of course we shouldn’t use LLMs to study uncertainty, because these models have already read many of the books we’re studying and will (presumably) already know the plot.

This is a particularly nasty problem because we don’t have a list of the books commercial models were trained on. We’re flying blind. But before we give up, let’s test how much of a problem this really poses. Researchers at Berkeley have defined a convenient test of the extent to which a model has memorized a book (Chang et al. 2023). In essence, they ask the model to fill in missing names.

Running this test, Chang et al. find that GPT-4 remembers many books in detail. Moreover, its ability to fill in masked names correlates with its accuracy on certain other tasks—like its ability to estimate date of publication. This could be a problem for questions about plot.

To avoid this problem (and also save money), I’ve been using GPT-3.5, which Chang et al. find is less prone to memorize books. But is that enough to address the problem? Let’s check. Below I’ve plotted the average divergence between prediction and summary for 25 novels on the y axis, and GPT-3.5’s ability to supply masked names in those texts on the x axis. If memorization was making prediction more accurate, we would expect to see a negative correlation: predictions’ divergence from summaries should go down as name_cloze accuracy goes up.

The y axis is average cosine distance between prediction and summary; x axis is GPT-3.5’s accuracy on the name cloze test defined in Chang et al.

25 books is not enough for a conclusive answer, but so far I cannot measure any pattern of that kind. (If anything, there is a faint trend in the opposite direction.)

In an ideal world, researchers would use language models trained on open data sets that they know and control. But until we get to an ideal world, it looks like it may be possible to run proof-of-concept experiments with things like GPT-3.5, at least if we avoid extremely famous books.

Scrutinizing the image above, readers will probably notice that the most predictable book in this sample was Zoya, by Danielle Steel. Although Steel has a reputation that may encourage disparaging inferences — see Dan Sinykin, Big Fiction, for why — I don’t think we’re in a position to draw those inferences yet. The local rhythms that make prediction possible across three pages are not necessarily what critics mean when they use “predictable” to diss a book.

So what could we learn from predicting the next three pages?

To consider one possible payoff: it might give us a handle on the way chapter-breaks, and other divides, structure the epistemic rhythms of fiction. For instance, many readers have noticed that the installments of novels originally published in magazines tend to end with an explicit mystery to ensure that you keep reading (Haugtvedt 2016 and Beekman 2017). In the first installment of Arthur Conan Doyle’s Hound of the Baskervilles (which covers two chapters), Watson and Holmes learn about a legendary curse that connects the family of the Baskervilles to a fiendish hound. In the final lines of the first installment, Holmes asks the family doctor about footprints found near the body of Sir Charles Baskerville. “A man’s or a woman’s?” Holmes asks. The doctor’s “voice sank almost to a whisper as he answered. ‘Mr Holmes, they were the footprints of a gigantic hound!'” End installment.

Novels don’t have soundtracks. But unexplained, suggestive new information is as good as a sting: “Bum – bum – BUM!”

The Hound of the Baskervilles, by Sidney Paget, 1902.

It appears that we can measure this cliffhanger effect: the serial installments of The Hound of the Baskervilles often end with a moment of heightened mystery — at least, if inability to predict the next three pages is any measure of mystery. When we measure predictive accuracy throughout the story, making four different passes to ensure we have roughly-900-word chunks aligned with all the chapter breaks, we find that predictions are farther from reality at the ends of serial installments. There is no similar effect at other chapter breaks.

The mean for breaks at serial installments is more than one standard deviation above the mean for other chapter-breaks. In spite of tiny n, this is actually p < .05.

Now, this is admittedly a cherry-picked example. So far, I have only looked at seven novels where we can distinguish the ends of serial installments from other kinds of chapter break (using data from Warhol et al). And I don’t see this pattern in all of them.

So I’m not yet making any historical argument about serialization and the rise of the cliffhanger. I’m just suggesting that it’s the kind of question someone could eventually address using this method. A doctoral student could do it, for instance, with a locally hosted model. (I don’t recommend doing it with GPT-3.5, because I dropped $150 or so on this post, and that might add up across a dissertation.) Some initial tests suggest to me that this approach will produce results significantly different than we’re getting with lexical methods.

Since I’m explicitly encouraging people to run with this, let me also say that someone actually writing a paper using this method might want to tinker with several things before trusting it:

  1. Measuring the distance between the embedding of one prediction sentence and one summary sentence is a crude way to measure expectation and surprise. Readers don’t necessarily form a single expectation about plot. Maybe it would be better to model expectation as a range of possibility?
  2. Related to this: models may need to be nudged to speculate and not just predict that current actions will continue.
  3. 900-word chunks may not be the only appropriate scale of analysis. When readers talk about narrative surprise they’re often thinking about larger arcs like “who will he marry?” or “who turns out to be the murderer?”
  4. We need a way to handle braided narratives where each chapter is devoted to a different group of characters (Garrett 1980). In a multi-plot story, the B or C plot will often not continue across a chapter break.

But we’re in a multi-plot narrative ourselves, so those problems may be solved by a different group of characters. This was just a blog post to share an idea and get people arguing about it. Tune in next time, for our thrilling conclusion. (Bum – bum – BUM!)

Code and data used for this post are available on Github.

The ideas discussed here were previously presented in Paris at a workshop on AI for the analysis of literary corpora, and in Copenhagen in a conference on generative methods in the social sciences and humanities. I’d like to thank the organizers of those events, esp. Thierry Poibeau, Anders Munk, and Rolf Lund, for stimulating conversation — and also many people in attendance, especially David Bamman, Lynn Cherny, and Meredith Martin. In writing code to query the OpenAI API, I borrowed snippets from Quinn Dombrowski (and also of course from GPT-4 itself oh brave new world &c). My thinking about 19c serialization was advanced by conversation with David Bishop and Eleanor Courtemanche, and by suggestions from Ryan Cordell and Elizabeth Foxwell on Bluesky.

References

Beekman, G. (2017) “Emotional Density, Suspense, and the Serialization of The Woman in White in All theYear Round.” Victorian Periodicals Review 50.1.

Chang, K., Cramer, M., Soni, S., & Bamman, D. (2023) “Speak, Memory: An Archaeology of Books known to ChatGPT / GPT-4,” https://arxiv.org/abs/2305.00118.

Dames, N. (2023) The Chapter: A Segmented History from Antiquity to the Twenty-First Century. Princeton University Press.

Garrett, P. (1980) The Victorian Multiplot Novel: Studies in Dialogical Form. Yale University Press.

Haugtvedt, E. (2016) “The Sympathy of Suspense: Gaskell and Braddon’s Slow and Fast Sensation Fiction in Family Magazines.” Victorian Periodicals Review 49.1.

McGrath, L., Higgins, D., & Hintze, A. (2018) “Measuring Modernist Novelty.” Journal of Cultural Analytics. https://culturalanalytics.org/article/11030-measuring-modernist-novelty

Piper, A., Xu, H., & Kolaczyk, E. D. (2023) “Modeling Narrative Revelation.” Computational Humanities Research 2023. https://ceur-ws.org/Vol-3558/paper6166.pdf

Sap, M., Horvitz, E., Choi, Y., Smith, N. A., & Pennebaker, J. (2020) “Recollection Versus Imagination: Exploring Human Memory and Cognition via Neural Language Models.” Proceedings of the 58th Annual Meeting of the ACL. https://aclanthology.org/2020.acl-main.178/

Sinykin, Dan. Big Fiction: How Conglomeration Changed the Publishing Industry and American Literature. Columbia University Press, 2023.

Smuts, A. (2009) “The Paradox of Suspense.” Stanford Encyclopedia of Philosophy. https://plato.stanford.edu/entries/paradox-suspense/#SusSur

Tobin, V. (2018) The Elements of Surprise: Our Mental Limits and the Satisfactions of Plot. Harvard University Press.

Warhol, Robyn, et al., “Reading Like a Victorian,” https://readinglikeavictorian.osu.edu/.

Categories
deep learning social effects of machine learning

Liberally-educated students need to be more than consumers of AI

The initial wave of controversy over large language models in education is dying down. We haven’t reached consensus about what to do yet. (Retreat to in-class exams? make tougher writing assignments? just forbid the use of AI?) But it’s clear to everyone now that the models will require some response.

In a month or so there will be a backlash to this debate, as professors discover that today’s models are still not capable of writing a coherent, footnoted, twelve-page research paper on their own. We may tell ourselves that the threat to education was overhyped, and congratulate ourselves on having addressed it.

That will be a mistake. We haven’t even begun to discuss the challenge AI poses for education.

For professors, yes, the initial disruption will be easy to address: we can find strategies that allow us to continue evaluating student work while teaching the courses we’re accustomed to teach. Problem solved. But the challenge that really matters here is a challenge for students, who will graduate into a world where white-collar work is being redefined. Some things will get easier: we may all have assistants to help us handle email. But by the same token, students will be asked to tackle bigger challenges.

Our vision of those challenges is confined right now by a discourse that treats models as paper-writing machines. But that’s hardly the limit of their capacity. For instance, models can read. So a lawyer in 2033 may be asked to “use a model to do a quick scan of new case law in these thirty jurisdictions and report back tomorrow on the implications for our project.” But then, come to think of it, a report is bulky and static. So you know what, “don’t write a report. What I really need is a model that’s prepared to provide an overview and then answer questions on this topic as they emerge in meetings over the next week.”

A decade from now, in short, we will probably be using AI not just to gather material and analyze it, but to communicate interactively with customers and colleagues. All the forms of critical thinking we currently teach will still have value in that world. It will still be necessary to ask questions about social context, about hidden assumptions, and about the uncertainty surrounding any estimate. But our students won’t be prepared to address those questions unless they also know enough about machine learning to reason about a model’s assumptions and uncertainty. At higher levels of responsibility, this will require more than being a clever prompter and savvy consumer. White-collar professionals are likely to be fine-tuning their own models; they will need to choose a base model, assess training strategies, and decide whether their models are over-confident or over-cautious.

Midjourney: “a college student looking at helpful utility robots in a department store window, HD photography, 80 mm lens –ar 17:9

The core problem here is that we can’t fully outsource thinking itself. I can trust Toyota to build me a good car, although I don’t know how fuel injection works. Maybe I read Consumer Reports and buy what they recommend? But if I’m buying a thinking process, I really need to understand what I see when I look under the hood. Otherwise “critical thinking” loses all meaning.

Academic conversation so far has not seemed to recognize this challenge. We have focused on preserving existing assignments, when we should be talking about the new courses and new assignments students will need to think critically in the 2030s.

Because AI has been framed as a collection of opaque gadgets, I know this advice will frustrate many readers. “How can we be expected to prepare students for the 2030s? It’s impossible to know the technical details of the tools that will be available. Besides, no one understands how models work. They’re black boxes. The best preparation we can provide is a general attitude of caveat emptor.”

This is an understandable response, because things moved too quickly in the past four years. We were still struggling to absorb the basic principles of statistical machine learning when statistical ML was displaced by a new generation of tools that seemed even more mysterious. Journalists more or less gave up on explaining things.

But there are principles that undergird machine learning. Statistical learning is really, at bottom, a theory of learning: it tries to describe mathematically what it means to generalize about examples. The concept of a “bias-variance tradeoff,” for instance, allows us to reason more precisely about the intuitive insight that there is some optimal level of abstraction for our models of the world.

Illustration borrowed from https://upscfever.com/upsc-fever/en/data/deeplearning2/2.html.

Deep learning admittedly makes things more complex than the illustration above implies. (In fact, understanding the nature of the generalization performed by LLMs is still an exciting challenge — see the first few minutes of this recent talk by Ilya Sutskever for an example of reflection on the topic.) But if students are going to be fine-tuning their own models, they will definitely need at least a foundation in concepts like “variance” and “overfitting.” A basic course on statistical learning should be part of the core curriculum.

I might go a little further, and suggest that professors in every department are going to want reflect on principles and applications of machine learning, so we can give students the background they need to keep thinking critically about our domain of expertise in a world where some (not all) aspects of reading and analysis may be automated.

Is this a lot of work? Yes. Are we already overburdened, and should we have more support? Yes and yes. Professors have already spent much of their lives mastering one field of expertise; asking them to pick up the basics of another field on the fly while doing their original jobs is a lot. So adapting to AI will happen slowly and it will be imperfect and we should all cut each other slack.

But to look on the bright side: none of this is boring. It’s not just a technical hassle to fend off. There are fundamental intellectual challenges here, and if we make it through these rapids in one piece, we’re likely to see some new things.

Categories
deep learning reproducibility and replication social effects of machine learning

We can save what matters about writing—at a price

It’s beginning to sink in that generative AI is going to force professors to change their writing assignments this fall. Corey Robin’s recent blog post is a model of candor on the topic. A few months ago, he expected it would be hard for students to answer his assignments using AI. (At least, it would require so much work that students would effectively have to learn everything he wanted to teach.) Then he asked his 15-year-old daughter to red-team his assignments. “[M]y daughter started refining her inputs, putting in more parameters and prompts. The essays got better, more specific, more pointed.”

Perhaps not every 15-year-old would get the same result. But still. Robin is planning to go with in-class exams “until a better option comes along.” It’s a good short-term solution.

In this post, I’d like to reflect on the “better options” we may need over the long term, if we want students to do more thinking than can fit into one exam period.

If you want an immediate pragmatic fix, there is good advice out there already about adjusting writing assignments. Institutions have not been asleep at the wheel. My own university has posted a practical guide, and the Modern Language Association and Conference on College Composition and Communication have (to their credit) quickly drafted a working paper on the topic that avoids panic and makes a number of wise suggestions. A recurring theme in many of these documents is “the value of process-focused instruction” (“Working Paper,” 10).

Why focus on process? A cynical way to think about it is that documenting the writing process makes it harder for students to cheat. There are lots of polished 5-page essays out there to imitate, but fewer templates that trace the evolution of an idea from an initial insight, through second thoughts, to dialectical final draft.

Making it harder to cheat is not a bad idea. But the MLA-CCCC task force doesn’t dwell on this cynical angle. Instead they suggest that we should foreground “process knowledge” and “metacognition” because those things were always the point of writing instruction. This is much the same thesis Corey Robin explores at the end of his post when he compares writing to psychotherapy: “Only on the couch have I been led to externalize myself, to throw my thoughts and feelings onto a screen and to look at them, to see them as something other, coldly and from a distance, the way I do when I write.”

Midjourney: “a hand writing with a quill reflected in a mirror, by MC Escher, in the style of meta-representation –ar 3:2 –weird 50”

Robin’s spin on this insight is elegiac: in losing take-home essays, we might lose an opportunity to teach self-critique. The task force spins it more optimistically, suggesting that we can find ways to preserve metacognition and even ways to use LLMs (large language models) to help students think about the writing process.

I prefer their optimistic spin. But of course, one can imagine an even-more-elegiac riposte to the task force report. “Won’t AI eventually find ways to simulate critical metacognition itself, writing the (fake) process reflection along with the final essay?”

Yes, that could happen. So this is where we reach the slightly edgier spin I feel we need to put on “teach the process” — which is that, over the long run, we can only save what matters about writing if we’re willing to learn something ourselves. It isn’t a good long-term strategy for us to approach these questions with the attitude that we (professors) have a fixed repository of wisdom — and the only thing AI should ever force us to discuss is, how to convey that wisdom effectively to students. If we take that approach, then yes, the game is over as soon as a model learns what we know. It will become possible to “cheat” by simulating learning.

But if the goal of education is actually to learn new things — and we’re learning those things along with our students — then simulating the process is not something to fear. Consider assignments that take the form of an experiment, for instance. Experiments can be faked. But you don’t get very far doing so, because fake experiments don’t replicate. If a simulated experiment does reliably replicate in the real world, we don’t call that “cheating” — but “in-silico research that taught us something new.”

If humanists and social scientists can find cognitive processes analogous to experiment — processes where a well-documented simulation of learning is the same thing as learning — we will be in the enviable position Robin originally thought he occupied: students who can simulate the process of doing an assignment will effectively have completed the assignment.

I don’t think most take-home essays actually occupy that safe position yet, because in reality our assignments often ask students to reinvent a wheel, or rehearse a debate that has already been worked through by some earlier generation. A number of valid (if perhaps conflicting) answers to our question are already on record. The verb “rehearse” may sound dismissive, but I don’t mean this dismissively. It can have real value to walk in the shoes of past generations. Sometimes ontogeny does need to recapitulate phylogeny, and we should keep asking students to do that, occasionally — even if they have to do it with pencil on paper.

But we will also need to devise new kinds of questions for advanced students—questions that are hard to answer even with AI assistance, because no one knows what the answer is yet. One approach is to ask students to gather and interpret fresh evidence by doing ethnography, interviewing people, digging into archival boxes, organizing corpora for text analysis, etc. These are assignments of a more demanding kind than we have typically handed undergrads, but that’s the point. Some things are actually easier now, and colleges may have to stretch students further in order to challenge them.

“Gathering fresh evidence” puts the emphasis on empirical data, and effectively preserves the take-home essay by turning it into an experiment. What about other parts of humanistic education: interpretive reflection, theory, critique, normative debate? I think all of those matter too. I can’t say yet how we’ll preserve them. It’s not the sort of problem one person could solve. But I am willing to venture that the meta-answer is, we’ll preserve these aspects of education by learning from the challenge and adapting these assignments so they can’t be fulfilled merely by rehearsing received ideas. Maybe, for instance, language models can help writers reflect explicitly on the wheels they’re reinventing, and recognize that their normative argument requires another twist before it will genuinely break new ground. If so, that’s not just a patch for writing assignments — but an advance for our whole intellectual project.

I understand that this is an annoying thesis. If you strip away the gentle framing, I’m saying that we professors will have to change the way we think in order to respond to generative AI. That’s a presumptuous thing to say about disciplines that have been around for hundreds of years, pursuing aims that remained relatively constant while new technologies came and went.

However, that annoying thesis is what I believe. Machine learning is not just another technology, and patching pedagogy is not going to be a sufficient response. (As Marc Watkins has recently noted, patching pedagogy with surveillance is a cure worse than the disease.) This time we can only save what matters about our disciplines if we’re willing to learn something in the process. The best I can do to make that claim less irritating is to add that I think we’re up for the challenge. I don’t feel like a voice crying in the wilderness on this. I see a lot of recent signs — from the admirable work of the MLA and CCCC to books like The Ends of Knowledge (eds. Scarborough and Rudy) — that professors are thinking creatively about a wide range of recent challenges, and are capable of responding in ways that are at once critical and self-critical. Learning is our job. We’ve got this.

References

Center for Innovation in Teaching and Learning, UIUC. “Artificial Intelligence Implications in Teaching and Learning.” Champaign, IL, 2023.

MLA-CCCC Joint Task Force on Writing and AI, “MLA-CCCC Joint Task Force on Writing and AI Working Paper: Overview of the Issues, Statement of Principles, and Recommendations,” July 2023.

Robin, Corey. “How ChatGPT Changed My Plans for the Fall,” July 30, 2023.

Rudy, Seth, and Rachel Scarborough King, The Ends of Knowledge: Outcomes and Endpoints across the Arts and Sciences. London: Bloomsbury, 2023.

Watkins, Marc. “Will 2024 look like 1984?” July 31, 2023.

Categories
deep learning fiction

Using GPT-4 to measure the passage of time in fiction

Language models have been compared to parrots, but the bigger danger is that they turn people into parrots. A student who asks for “a paper about Middlemarch,” for instance, will get a pastiche loosely based on many things in the model’s training set. This may not count as plagiarism, but it won’t produce anything new.

But there are ways to use language models actively and creatively. We can select evidence to be analyzed, put it in a prompt, and specify the questions to be asked. Used this way, language models can create new knowledge that didn’t exist when they were trained. There are many ways to do this, and people may eventually get quite creative. But let’s start with a familiar task, so we can evaluate the results and decide whether language models really help. An obvious place to start is “content analysis”—a research method that analyzes hundreds or thousands of documents by posing questions about specific themes.

Below I work through a simple example of content analysis using the OpenAI API to measure the passage of time in fiction (see this GitHub repo for code). To spoil the suspense: I find that for this task, language models add something valuable, not just because they’re more accurate than older ways of doing this at scale but because they explain themselves better.

Why measure time in fiction?

Researchers already have several good ways to automate content analysis. Named entity extraction addresses certain kinds of questions. Topic modeling addresses others.

But there are also tricky questions that remain hard to answer by counting words. In 2017, for instance, I started to wonder how much time passes, on average, across a page of a novel. Literary-critical tradition suggested that there had been a pretty stable balance between “scene” (minute-by-minute description) and “summary” (which may cover weeks or years) until modernists started to leave out the summary and make every page breathlessly immediate [1]. But when I sat down with two graduate students (Sabrina Lee and Jessica Mercado) to manually characterize a thousand passages from fiction, we found instead a long trend. The average length of time represented in 250 words of fiction had been getting steadily shorter since the early eighteenth century. There was a trend toward immediacy, in other words, but modernism didn’t begin the trend [2].


The average length of time described in 250 words of narration. Y axis is logarithmic. The shaded ribbon represents a 95% confidence interval for the dashed curve, which is itself calculated by loess regression. From Underwood, “Why Literary Time is Measured in Minutes,” p. 352.

How well do bag of-words methods estimate time?

We decided to characterize passages manually because references to time in fiction can’t always be taken literally. If a character thinks (or says) “wow, it’s been thirty years but feels like yesterday,” you don’t want to conclude that thirty years have passed on the page. So word-counting seemed risky. Even as human readers we often found it hard to decide how much time was passing. But when a passage was read by two different people, our estimates agreed with each other well enough to conclude that “fictive time” was a meaningful construct, if not a precise one (r = .74 on log-transformed durations).

A year after we published our paper, Greg Yauney showed that word-counting methods can do an acceptable job of estimating time [3]. He trained a bag-of-words model on the passages we had labeled, and applied it to a new set of passages labeled by new readers. The model-predicted durations correlated with human estimates at r = .35. While this is much lower than inter-human agreement, the model was stable enough to precisely measure a trend (across thousands of books) that matched the trend we had sketched using laborious human reading of a hundred books.

From Yauney, Underwood, and Mimno, “Computational Prediction of Elapsed Narrative Time.”

Replicating this on our old data, I get a slightly higher correlation between models and readers (r = .49 on log-transformed durations), perhaps because the data I’m using was produced by readers who compared notes for several weeks to maximize agreement. But since this is still lower than .74, bag-of-words models are definitely less good at estimating time than human readers.

Can we train LLMs to estimate time?

To get a large language model to answer a question, you first need to make sure it understands the question. As Simon Willison helpfully explains, the way to train a chat model through the API is to demonstrate the interaction you want by providing a series of imagined exchanges between a “user” and an “assistant” [4]. These aren’t real replies from the assistant but ego ideals you’re providing to teach it how to behave.

Here’s the instruction I had the “user” give:

Read the following passage of fiction. Then do five things.
 
1: Briefly summarize the passage.
2: Reason step by step to decide how much time is described in the passage. If the passage doesn't include any explicit reference to time, you can guess how much time the events described would have taken. Even description can imply the passage of time by describing the earlier history of people or buildings. But characters' references to the past or future in spoken dialogue should not count as time that passed in the scene. Report the time using units of years, weeks, days, hours, or minutes. Do not say zero or N/A.
3: If you described a range of possible times in step 2 take the midpoint of the range. Then multiply to convert the units into minutes.
4: Report only the number of minutes elapsed, which should match the number in step 3. Do not reply N/A.
5: Given the amount of speculation required in step 2, describe your certainty about the estimate--either high, moderate, or low.
The passage follows: <omitting this to save space>

You’ll notice that I’m using chain-of-thought prompting, building slowly from summary to inference to an estimated number of minutes. You may also notice I tell the model twice “don’t reply N/A.” I found the model was pretty cautious about speculating and needed to be encouraged to make assumptions and guesses. But to be honest, it’s possible that the model was right to be cautious! Duration is hard to estimate in some cases, and maybe we humans should have allowed N/A responses when we did this.

Here’s an example of an ego-ideal reply I provided to teach the “assistant” how to respond. This was responding to a passage from The Big Sleep.

1: A detective wakes up 'the next morning,' looks out a window for an undefined time, drinks (and presumably needs to make) two cups of coffee, then shaves and showers and gets dressed before stepping out his front door and seeing a car.
2: Making coffee, showering, and getting dressed take at least an hour. There's some ambiguity about whether to count the implicit reference to yesterday (since this is 'the next morning') as time elapsed in the passage, but let's say no, since yesterday is not actually described. So, an hour to 90 minutes.
3: 1.25 hours have elapsed. Multiplying by 60 minutes an hour that's 75 minutes.
4: 75 minutes.
5: Low confidence, because of ambiguity about a reference to the previous day.

Using some code generously shared by Quinn Dombrowski, I gave the model four query-reply sequences like this, then asked it to characterize a new passage with the same instructions.

To assess performance on this simple task, I just extracted the minutes reported in step 4 and compared them to our human estimates from 2017. Other forms of content analysis might require the model to edit or mark up the text provided in the prompt. That’s doable, but I thought I would start with something simple.

I added step 5 (allowing the model to describe its own confidence) because in early experiments I found the model’s tendency to editorialize extremely valuable. The answers to step 5 were also fun to read as replies scrolled up the page, because my new sorcerer’s assistant complained volubly about the ambiguity of its task. (“30 minutes. Low confidence, as the passage is more focused on the poetic and symbolic aspects of the scene rather than providing a clear sense of time.”). In some cases it refused the task altogether and threw the question about confidence back in my face. (“N/A. High confidence that no specific amount of time is described.”) This capacity for backtalk is a feature not a bug: I learned a lot from it.

But after I adjusted my prompt to address the ambiguities the assistant correctly pointed out, complete refusal to answer the question was rare.

How well does GPT-4 estimate time?

I had the Turbo model code 483 passages. Its predictions correlated with human estimates at r = .59. I only asked GPT-4 to code 121 passages (because it’s more expensive to run than Turbo), and it achieved r = .68. This is not as good as inter-human agreement (.74), but it’s closer to human readers than to bag of words models (.35 – .49). And of course GPT-4 does the work more quickly than human readers. It took the three of us several months to generate this data, but my LLM experiment was run in a couple of days. Plus, given an API, large language models are easier to use than other forms of machine learning: the main challenge is to describe your question systematically. This is a method that could realistically be used by researchers with relatively little programming experience.

The total cost to my OpenAI account was $22. Of that amount, about $8 was testing the prompts and running on the (cheaper) Turbo API. The final run on GPT-4 cost $14; if I had completely replicated our work from 2017 on GPT-4 it would have cost $140. So a project like this will not bankrupt researchers. But reliance on OpenAI is still a bad idea in the long run. Universities should develop their own models and APIs, both for reasons of institutional autonomy/survival (!!😬!!) and so that researchers can inspect and tune the models.

However, one could exaggerate the problems of interpretability created by a closed model. Watching the model think out loud about each passage actually gave me a better sense of how results were produced than I could extract from my own records of human research in 2017. (See below for an example from GPT-4: human readers don’t usually document their thought process this verbosely.) Also, because human beings can’t erase their own memories, it’s not possible for us to “repeat the same question with slightly different language.” If I try that on myself, I’m going to remember what I thought before and probably think it again. But a model can wake up each moment with eternal sunshine of the spotless mind.

ABRIDGED PASSAGE FROM Jane Eyre: And signs, for aught we know, may be but the sympathies of Nature with man. When I was a little girl, only six years old, I one night heard Bessie Leaven say to Martha Abbot that she had been dreaming about a little child ... Of late I had often recalled this saying and this incident; for during the past week scarcely a night had gone over my couch that had not brought with it a dream of an infant, which I sometimes hushed in my arms, sometimes dandled on my knee, sometimes watched playing with daisies on a lawn, or again, dabbling its hands in running water. It was a wailing child this night, and a laughing one the next: now it nestled close to me, and now it ran from me; but whatever mood the apparition evinced, whatever aspect it wore, it failed not for seven successive nights to meet me the moment I entered the land of slumber.

ASSISTANT REPLIES. 1: The narrator recalls a saying about dreaming of children being a sign of trouble and remembers an incident from her childhood. She then describes her recent dreams of infants over the past week. 
2: The passage mentions the narrator's childhood memory, which happened when she was six years old, and the past week of dreams. The time elapsed in the passage is the week of dreams. 
3: 1 week has elapsed. At seven days a week that's 7 days. 24 hours in a day, so 168 hours. 60 minutes in an hour, so 10080 minutes. 
4: 10080 minutes. 
5: Moderate confidence, as the childhood memory is not counted as time passing in the scene.

What no model can do

While I think large language models will be tremendously useful assistants in content analysis, I don’t think we can dispense with multiple human readers. As David Bamman explains in a good Twitter thread, the core challenges of this work are “a. coming up with a new construct to measure, b. demonstrating that it can be measured, and c. showing that there’s value in doing so.” Intersubjective agreement between human beings is still the only way to know that we have addressed those questions.

That’s why I wouldn’t feel comfortable just making up my own definition of, say, “suspense,” teaching a model to measure it, and then running the model across a thousand books. The problem with that approach is not that LLMs have measurement error. (As we’ve seen in Greg Yauney’s paper, measurement methods with much more error can be productive.) The problem is that it isn’t clear what a single researcher’s construct means in the first place. To be confident that we’re measuring something called “suspense” we need to show that multiple people recognize it as suspense. And in spite of all the rhetorical personification in this blog post (which I trust you have understood rhetorically), a model is not a separate person. So a project of this kind still needs to start by getting several human beings to read systematically and compare notes, before the human codebook is translated into a prompt.

On the other hand, I do think language models will allow us to pose questions that are currently hard to pose at scale. Questions about plot and character, for instance, require reading a whole book while applying delicate decision criteria that may be hard to remember and keep stable across many books. Language models will help here not just because they’re fast, but because they can provide a relatively stable yardstick by which to measure slippery concepts, even at a modest scale of analysis.

This is a preliminary report on a very strange world. For me, the most surprising take-away from this experiment was not that deep learning is more accurate than statistical NLP, but that it may also be in some ways more interpretable. Because a language model has to think out loud, it tends to automatically document its own reasoning. This is useful even when (or especially when) the model doesn’t think you’ve defined the construct it’s supposed to measure clearly enough.

References

[1] Gérard Genette, Narrative Discourse: An Essay in Method, trans. Jane E. Lewin (Ithaca: Cornell Univ. Press, 1980), 97.

[2] Ted Underwood, “Why Literary Time is Measured in Minutes,” New Literary History 85.2 (2018) 351-65.

[3] Greg Yauney, Ted Underwood, and David Mimno, “Computational Prediction of Elapsed Narrative Time” (2019).

[4] Simon Willison, “A Simple Python Wrapper for the ChatGPT API” (2023).

Categories
interpretive theory machine learning social effects of machine learning transformer models

Mapping the latent spaces of culture

[On Tues Oct 26, the Center for Digital Humanities at Princeton will sponsor a roundtable on the implications of “Stochastic Parrots” for the humanities. To prepare for that roundtable, they asked three humanists to write position papers on the topic. Mine follows. I’ll give a 5-min 500-word précis at the event itself; this is the 2000-word version, with pictures. It also has a DOI if you want a stable version to cite.]

The technology at the center of this roundtable doesn’t yet have a consensus name. Some observers point to an architecture, the Transformer.[1] “On the Dangers of Stochastic Parrots” focuses on size and discusses “large language models.”[2] A paper from Stanford emphasizes applications: “foundation models” are those that can adapt “to a wide range of downstream tasks.”[3] Each definition identifies a different feature of recent research as the one that matters. To keep that question open, I’ll refer here to “deep neural models of language,” a looser category.

However we define them, neural models of language are already changing the way we search the web, write code, and even play games. Academics outside computer science urgently need to discuss their role. “On the Dangers of Stochastic Parrots” deserves credit for starting the discussion—especially since publication required tenacity and courage. I am honored to be part of an event exploring its significance for the humanities.

The argument that Bender et al. advance has two parts: first, that large language models pose social risks, and second, that they will turn out to be “misdirected research effort” anyway, since they pretend to perform “natural language understanding” but “do not have access to meaning” (615).

I agree that the trajectory of recent research is dangerous. But to understand the risks language models pose, I think we will need to understand how they produce meaning. The premise that they simply “do not have access to meaning” tends to prevent us from grasping the models’ social role. I hope humanists can help here by offering a wider range of ways to think about the work language does.

It is true that language models don’t yet represent their own purposes or an interlocutor’s state of mind. These are important aspects of language, and for “Stochastic Parrots,” they are the whole story: the article defines meaning as “meaning conveyed between individuals” and “grounded in communicative intent” (616). 

But in historical disciplines, it is far from obvious that all meaning boils down to intentional communication between individuals. Historians often use meaning to describe something more collective, because the meaning of a literary work, for example, is not circumscribed by intent. It is common for debates about the meaning of a text to depend more on connections to books published a century earlier (or later) than on reconstructing the author’s conscious plan.[4]

I understand why researchers in a field named “artificial intelligence” would associate meaning with mental activity and see writing as a dubious proxy for it. But historical disciplines rarely have access to minds, or even living subjects. We work mostly with texts and other traces. For this reason, I’m not troubled by the part of “Stochastic Parrots” that warns about “the human tendency to attribute meaning to text” even when the text “is not grounded in communicative intent” (618, 616). Historians are already in the habit of finding meaning in genres, nursery rhymes, folktale motifs, ruins, political trends, and other patterns that never had a single author with a clear purpose.[5] If we could only find meaning in intentional communication, we wouldn’t find much meaning in the past at all. So not all historical researchers will be scandalized when we hear that a model is merely “stitching together sequences of linguistic forms it has observed in its vast training data” (617). That’s often what we do too, and we could use help.

A willingness to find meaning in collective patterns may be especially necessary for disciplines that study the past. But this flexibility is not limited to scholars. The writers and artists who borrow language models for creative work likewise appreciate that their instructions to the model acquire meaning from a training corpus. The phrase “Unreal Engine,” for instance, encourages CLIP to select pictures with a consistent, cartoonified style. But this has nothing to do with the dictionary definition of “unreal.” It’s just a helpful side-effect of the fact that many pictures are captioned with the name of the game engine that produced them.

In short, I think people who use neural models of language typically use them for a different purpose than “Stochastic Parrots” assumes. The immediate value of these models is often not to mimic individual language understanding, but to represent specific cultural practices (like styles or expository templates) so they can be studied and creatively remixed. This may be disappointing for disciplines that aspire to model general intelligence. But for historians and artists, cultural specificity is not disappointing. Intelligence only starts to interest us after it mixes with time to become a biased, limited pattern of collective life. Models of culture are exactly what we need.

While I’m skeptical that language models are devoid of meaning, I do share other concerns in “Stochastic Parrots.” For instance, I agree that researchers will need a way to understand the subset of texts that shape a model’s response to a given prompt. Culture is historically specific, so models will never be free of omission and bias. But by the same token, we need to know which practices they represent. 

If companies want to offer language models as a service to the public—say, in web search—they will need to do even more than know what the models represent. Somehow, a single model will need to produce a picture of the world that is acceptable to a wide range of audiences, without amplifying harmful biases or filtering out minority discourses (Bender et al., 614). That’s a delicate balancing act.  

Historians don’t have to compress their material as severely. Since history is notoriously a story of conflict, and our sources were interested participants, few people expect historians to represent all aspects of the past with one correctly balanced model. On the contrary, historical inquiry is usually about comparing perspectives. Machine learning is not the only way to do this, but it can help. For instance, researchers can measure differences of perspective by training multiple models on different publication venues or slices of the timeline.[6]

When research is organized by this sort of comparative purpose, the biases in data are not usually a reason to refrain from modeling—but a reason to create more corpora and train models that reflect a wider range of biases. On the other hand, training a variety of models becomes challenging when each job requires thousands of GPUs. Tech companies might have the resources to train many models at that scale. But will universities? 

There are several ways around this impasse. One is to develop lighter-weight models.[7] Another is to train a single model that can explicitly distinguish multiple perspectives. At present, researchers create this flexibility in a rough and ready way by “fine-tuning” BERT on different samples. A more principled approach might design models to recognize the social structure in their original training data. One recent paper associates each text with a date stamp, for instance, to train models that respond differently to questions about different years.[8] Similar approaches might produce models explicitly conditioned on variables like venue or nationality—models that could associate each statement or prediction they make with a social vantage point.

If neural language models are to play a constructive role in research, universities will also need alternatives to material dependence on tech giants. In 2020, it seemed that only the largest corporations could deploy enough silicon to move this field forward. In October 2021, things are starting to look less dire. Coalitions like EleutherAI are reverse-engineering language models.[9] Smaller corporations like HuggingFace are helping to cover underrepresented languages. NSF is proposing new computing resources.[10] The danger of oligopoly is by no means behind us, but we can at least begin to see how scholars might train models that represent a wider range of perspectives.

Of course, scholars are not the only people who matter. What about the broader risks of language modeling outside universities?

I agree with the authors of “Stochastic Parrots” that neural language models are dangerous. But I am not sure that critical discourse has alerted us to the most important dangers yet. Critics often prefer to say that these models are dangerous only because they don’t work and are devoid of meaning. That may seem to be the strongest rhetorical position (since it concedes nothing to the models), but I suspect this hard line also prevents critics from envisioning what the models might be good for and how they’re likely to be (mis)used.

Consider the surprising art scene that sprang up when CLIP was released. OpenAI still hasn’t released the DALL-E model that translates CLIP’s embeddings of text into images.[11] But that didn’t stop graduate students and interested amateurs from duct-taping CLIP to various generative image models and using the contraption to explore visual culture in dizzying ways. 

“The angel of air. Unreal Engine,” VQGAN + CLIP, Aran Komatsukaki, May 31, 2021.

Will the emergence of this subculture make any sense if we assume that CLIP is just a failed attempt to reproduce individual language use? In practice, the people tinkering with CLIP don’t expect it to respond like a human reader. More to the point, they don’t want it to. They’re fascinated because CLIP uses language differently than a human individual would—mashing together the senses and overtones of words and refracting them into the potential space of internet images like a new kind of synesthesia.[12] The pictures produced are fascinating, but (at least for now) too glitchy to impress most people as art. They’re better understood as postcards from an unmapped latent space.[13] The point of a postcard, after all, is not to be itself impressive, but to evoke features of a larger region that looks fun to explore. Here the “region” is a particular visual culture; artists use CLIP to find combinations of themes and styles that could have occurred within it (although they never quite did).

“The clockwork angel of air, trending on ArtStation,” Diffusion + CLIP, @rivershavewings (Katherine Crowson), September 14, 2021.

Will models of this kind also have negative effects? Absolutely. The common observation that “they could reinforce existing biases” is the mildest possible example. If we approach neural models as machines for mapping and rewiring collective behavior, we will quickly see that they could do much worse: for instance, deepfakes could create new hermetically sealed subcultures and beliefs that are impossible to contest. 

I’m not trying to decide whether neural language models are good or bad in this essay—just trying to clarify what’s being modeled, why people care, and what kinds of (good or bad) effects we might expect. Reaching a comprehensive judgment is likely to take decades. After all, models are easy to distribute. So this was never a problem, like gene splicing, that could stay bottled up as an ethical dilemma for one profession that controlled the tools. Neural models more closely resemble movable type: they will change the way culture is transmitted in many social contexts. Since the consequences of movable type included centuries of religious war in Europe, the analogy is not meant to reassure. I just mean that questions on this scale don’t get resolved quickly or by experts. We are headed for a broadly political debate about antitrust, renewable energy, and the shape of human culture itself—a debate where everyone will have some claim to expertise.[14]

Let me end, however, on a positive note. I have suggested that approaching neural models as models of culture rather than intelligence or individual language use gives us even more reason to worry. But it also gives us more reason to hope. It is not entirely clear what we plan to gain by modeling intelligence, since we already have more than seven billion intelligences on the planet. By contrast, it’s easy to see how exploring spaces of possibility implied by the human past could support a more reflective and more adventurous approach to our future. I can imagine a world where generative models of culture are used grotesquely or locked down as IP for Netflix. But I can also imagine a world where fan communities use them to remix plot tropes and gender norms, making “mass culture” a more self-conscious, various, and participatory phenomenon than the twentieth century usually allowed it to become. 

I don’t know which of those worlds we will build. But either way, I suspect we will need to reframe our conversation about artificial intelligence as a conversation about models of culture and the latent spaces they imply. Philosophers and science fiction writers may enjoy debating whether software can have mental attributes like intention. But that old argument does little to illuminate the social questions new technologies are really raising. Neural language models are dangerous and fascinating because they can illuminate and transform shared patterns of behavior—in other words, aspects of culture. When the problem is redescribed this way, the concerns about equity foregrounded by “Stochastic Parrots” still matter deeply. But the imagined contrast between mimicry and meaning in the article’s title no longer connects with any satirical target. Culture clearly has meaning. But I’m not sure that anyone cares whether a culture has autonomous intent, or whether it is merely parroting human action.


[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, “Attention is All You Need,” 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, 2017. https://arxiv.org/abs/1706.03762 
[2] Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Margaret Mitchell, “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” FAccT '21: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, March 2021, 610–623. https://doi.org/10.1145/3442188.3445922
[3] Rishi Bommasani et al., “On the Opportunities and Risks of Foundation Models,” CoRR Aug 2021, https://arxiv.org/abs/2108.07258, 3-4.
[4] “[I]t is language which speaks, not the author.” Roland Barthes, “The Death of the Author,” Image / Music / Text, trans. Stephen Heath (New York: Hill and Wang, 1977), 143. 
[5] To this list one might also add the material and social aspects of book production. In commenting on “Stochastic Parrots,” Katherine Bode notes that book history prefers to paint a picture where “meaning is dispersed across…human and non-human agents.” Katherine Bode, qtd. in Lauren M. E. Goodlad, “Data-centrism and its Discontents,” Critical AI, Oct 15, 2021, https://criticalai.org/2021/10/14/blog-recap-stochastic-parrots-ethics-of-data-curation/
[6] Sandeep Soni, Lauren F. Klein, and Jacob Eisenstein, “Abolitionist Networks: Modeling Language Change in Nineteenth-Century Activist Newspapers,” Journal of Cultural Analytics, January 18, 2021, https://culturalanalytics.org/article/18841-abolitionist-networks-modeling-language-change-in-nineteenth-century-activist-newspapers. Ted Underwood, “Machine Learning and Human Perspective,” PMLA 135.1 (Jan 2020): 92-109, http://hdl.handle.net/2142/109140.
[7] I’m writing about “neural language models” rather than “large” ones because I don’t assume that ever-increasing size is a definitional feature of this technology. Strategies to improve efficiency are discussed in Bommasani et al., 97-100.
[8] Bhuwan Dhingra, Jeremy R. Cole, Julian Martin Eisenschlos, Daniel Gillick, Jacob Eisenstein and William W. Cohen, “Time-Aware Language Models as Temporal Knowledge Bases,” CoRR 2021, https://arxiv.org/abs/2106.15110.
[9] See for instance, Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman, “GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow,” March 2021, https://doi.org/10.5281/zenodo.5297715.
<a href="#_ftnref9"[10] White House Briefing Room, “The White House Announces the National Artificial Intelligence Research Resource Task Force,” June 10, 2021, https://www.whitehouse.gov/ostp/news-updates/2021/06/10/the-biden-administration-launches-the-national-artificial-intelligence-research-resource-task-force/
[11] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, Ilya Sutskever, “Zero-Shot Text-to-Image Generation,” February 2021, https://arxiv.org/abs/2102.12092.
[12] One good history of this scene is titled “Alien Dreams”—a title that concisely indicates how little interest artists have in using CLIP to reproduce human behavior. Charlie Snell, “Alien Dreams: An Emerging Art Scene,” June 30, 2021, https://ml.berkeley.edu/blog/posts/clip-art/.
[13] For a skeptical history of this spatial metaphor, see Nick Seaver, “Everything Lies in a Space: Cultural Data and Spatial Reality,” Journal of the Royal Anthropological Institute 27 (2021). https://doi.org/10.1111/1467-9655.13479. We also skeptically probe the limits of spatial metaphors for culture (but end up confirming their value) in Ted Underwood and Richard Jean So, “Can We Map Culture?” Journal of Cultural Analytics, June 17, 2021, https://doi.org/10.22148/001c.24911.
[14] I haven’t said much about the energy cost of training models. For one thing, I’m not fully informed about contemporary efforts to keep that cost low. More importantly, I think the cause of carbon reduction is actively harmed by pitting different end users against each other. If we weigh the “carbon footprint” of your research agenda against my conference travel, the winner will almost certainly be British Petroleum. Renewable energy is a wiser thing to argue about if carbon reduction is actually our goal. Mark Kaufman, “The carbon footprint sham: a ‘successful, deceptive’ PR campaign,” Mashable, July 9, 2021, https://mashable.com/feature/carbon-footprint-pr-campaign-sham.

Categories
machine learning transformer models

Science fiction hasn’t prepared us to imagine machine learning.

Science fiction did a great job preparing us for submarines and rockets. But it seems to be struggling lately. We don’t know what to hope for, what to fear, or what genre we’re even in.

Space opera? Seems unlikely. And now that we’ve made it to 2021, the threat of zombie apocalypse is receding a bit. So it’s probably some kind of cyberpunk. But there are many kinds of cyberpunk. Should we get ready to fight AI or to rescue replicants from a sinister corporation? It hasn’t been obvious. I’m writing this, however, because recent twists in the plot seem to clear up certain mysteries, and I think it’s now possible to guess which subgenre the 2020s are steering toward.

Clearly some plot twist involving machine learning is underway. It’s been hard to keep up with new developments: from BERT (2018) to GPT-3 (2020)—which can turn a prompt into an imaginary news story—to, most recently, CLIP and DALL-E (2021), which can translate verbal descriptions into images.

Output from DALL-E. If you prefer, you can have a baby daikon radish in a tutu walking a dog.

I have limited access to DALL-E, and can’t test it in any detail. But if we trust the images released by Open AI, the model is good at fusing and extrapolating abstractions: it not only knows what it means for a lemur to hold an umbrella, but can produce a surprisingly plausible “photo of a television from the 1910s.” All of this is impressive for a research direction that isn’t much more than four years old.

The prompt here is “a photo of a television from the …<fill in the decade>”

On the other hand, some AI researchers don’t believe these models are taking the field in the direction it was supposed to go. Gary Marcus and Ernest Davies, for instance, doubt that GPT-3 is “an important step toward artificial general intelligence—the kind that would … reason broadly in a manner similar to humans … [GPT-3] learns correlations between words, and nothing more.”

People who want to contest that claim can certainly find evidence on the other side of the question. I’m not interested in pursuing the argument here. I just want to know why recent advances in deep learning give me a shivery sense that I’ve crossed over into an unfamiliar genre. So let’s approach the question from the other side: what if these models are significant because they don’t reason “in a manner similar to humans”?

It is true, after all, that models like DALL-E and GPT-3 are only learning (complex, general) patterns of association between symbols. When GPT-3 generates a sentence, it is not expressing an intention or an opinion—just making an inference about the probability of one sentence in a vast “latent space” of possible sentences implied by its training data.

When I say “a vast latent space,” I mean really vast. This space includes, for instance, the thoughts Jerome K. Jerome might have expressed about Twitter if he had lived in our century.

Mario Klingemann gets GPT-3 to extrapolate from a title and a byline.

But a latent space, however vast, is still quite different from goal-driven problem solving. In a sense the chimpanzee below is doing something more like human reasoning than a language model can.

Primates, understandably, envision models of the world as things individuals create in order to reach bananas. (Ultimately from Wolfgang Köhler, The Mentality of Apes, 1925.)

Like us, the chimpanzee has desires and goals, and can make plans to achieve them. A language model does none of that by itself—which is probably why language models are impressive at the paragraph scale but tend to wander if you let them run for pages.

So where does that leave us? We could shrug off the buzz about deep learning, say “it’s not even as smart as a chimpanzee yet,” and relax because we’re presumably still living in a realist novel.

And yes, to be sure, deep learning is in its infancy and will be improved by modeling larger-scale patterns. On the other hand, it would be foolish to ignore early clues about what it’s good for. There is something bizarrely parochial about a view of mental life that makes predicting a nineteenth-century writer’s thoughts about Twitter less interesting than stacking boxes to reach bananas. Perhaps it’s a mistake to assume that advances in machine learning are only interesting when they resemble our own (supposedly “general”) intelligence. What if intelligence itself is overrated?

The collective symbolic system we call “culture,” for instance, coordinates human endeavors without being itself intelligent. What if models of the world (including models of language and culture) are important in their own right—and needn’t be understood as attempts to reproduce the problem-solving behavior of individual primates? After all, people are already very good at having desires and making plans. We don’t especially need a system that will do those things for us. But we’re not great at imagining the latent space of (say) all protein structures that can be created by folding amino acids. We could use a collaborator there.

Storytelling seems to be another place where human beings sense a vast space of latent possibility, and tend to welcome collaborators with maps. Look at what’s happening to interactive fiction on sites like AI Dungeon. Tens of thousands of users are already making up stories interactively with GPT-3. There’s a subreddit devoted to the phenomenon. Competitors are starting to enter the field. One startup, Hidden Door, is trying to use machine learning to create a safe social storytelling space for children. For a summary of what collaborative play can build, we could do worse than their motto: “Worlds with Friends.”

It’s not hard to see how the “social play” model proposed by Hidden Door could eventually support the form of storytelling that grown-ups call fan fiction. Characters or settings developed by one author might be borrowed by others. Add something like DALL-E, and writers could produce illustrations for their story in a variety of styles—from Arthur Rackham to graphic novel.

Will a language model ever be as good as a human author? Can it ever be genuinely original? I don’t know, and I suspect those are the wrong questions. Storytelling has never been a solitary activity undertaken by geniuses who invent everything from scratch. From its origin in folk tales, fiction has been a game that works by rearranging familiar moves, and riffing on established expectations. Machine learning is only going to make the process more interactive, by increasing the number of people (and other agents) involved in creating and exploring fictional worlds. The point will not be to replace human authors, but to make the universe of stories bigger and more interconnected.

Storytelling and protein folding are two early examples of domains where models will matter not because they’re “intelligent,” but because they allow us—their creators—to collaboratively explore a latent space of possibility. But I will be surprised if these are the only two places where that pattern emerges. Music and art, and other kinds of science, are probably open to the same kind of exploration.

This collaborative future could be weirder than either science fiction or journalism have taught us to expect. News stories about ML invariably invite readers to imagine autonomous agents analogous to robots: either helpful servants or inscrutable antagonists like the Terminator and HAL. Boring paternal condescension or boring dread are the only reactions that seem possible within this script.

We need to be considering a wider range of emotions. Maybe a few decades from now, autonomous AI will be a reality and we’ll have to worry whether it’s servile or inscrutable. Maybe? But that’s not the genre we’re in at the moment. Machine learning is already transforming our world, but the things that should excite and terrify us about the next decade are not even loosely analogous to robots. We should be thinking instead about J. L. Borges’ Library of Babel—a vast labyrinth containing an infinite number of books no eye has ever read. There are whole alternate worlds on those shelves, but the Library is not a robot, an alien, or a god. It is just an extrapolation of human culture.

Eric Desmazieres, “The Library of Babel.”

Machine learning is going to be, let’s say, a thread leading us through this Library—or perhaps a door that can take us to any bookshelf we imagine. So if the 2020s are a subgenre of SF, I would personally predict a mashup of cyberpunk and portal fantasy. With sinister corporations, of course. But also more wardrobes, hidden doors, encylopedias of Tlön, etc., than we’ve been led to expect in futuristic fiction.

I’m not saying this will be a good thing! Human culture itself is not always a good thing, and extrapolating it can take you places you don’t want to go. For instance, movements like QAnon make clear that human beings are only too eager to invent parallel worlds. Armored with endlessly creative deepfakes, those worlds might become almost impenetrable. So we’re probably right to fear the next decade. But let’s point our fears in a useful direction, because we have more interesting things to worry about than a servant who refuses to “open the pod bay doors.” We are about to be in a Borges story, or maybe, optimistically, the sort of portal fantasy where heroines create doors with a piece of chalk and a few well-chosen words. I have no idea how our version of that story ends, but I would put a lot of money on “not boring.”

Categories
fiction plot

How predictable is fiction?

This blog post is loosely connected to a talk I’m giving (virtually) at the Workshop on Narrative Understanding, Storylines, and Events at the ACL. It’s an informal talk, exploring some of the challenges and opportunities we encounter when we take the impressive sentence-level tools of contemporary NLP and try to use them to produce insights about book-length documents.

Questions about the “predictability” of fiction started to interest me after I read a preprint by Maarten Sap et al. on the difference between “recollected” and “imagined” stories. There’s a lot in the paper, but the thing that especially caught my eye was that a neural language model (GPT) does better predicting the next sentence in imagined stories than in recollected stories about biographical events. The authors persuasively interpret this as a sign that imagined stories have been streamlined by a process of “narrativization.”

The stories in that article are very short narratives made up (or recalled) by experimental subjects. But, given my background in literary history, I wondered whether the same contrast might appear between book-length works of fiction and biography. Are fictional narratives in some sense more predictable than nonfiction?

One could say we already know the answer. Fiction is governed by plot conventions, so of course it makes sense that it’s predictable! But an equally intuitive argument could be made that fiction entertains readers by baffling and eluding their expectations about what, specifically, will happen next. Perhaps it ought to be less predictable than nonfiction? In short, there are basic questions about fiction that don’t have clear general answers yet, although we’re getting better at framing the questions. (See e.g. Caroline Levine on The Serious Pleasures of Suspense, Vera Tobin on Elements of Surprise, or Andrew Piper’s chapters on “Plot” and “Fictionality” in Enumerations.)

Plus, even if it were intuitively obvious that fiction is more strongly governed by plot conventions than by surprise, it might be interesting to measure the strength of those conventions in particular works. If we could do that, we’d have new evidence for a host of familiar debates about tradition and innovation.

So, how to do it? Sap et al. measure “narrative flow” by using a neural language model that can judge whether a sentence is likely to occur in a given context. It’s a good strategy for paragraph or page-sized stories, but I suspect sentences may be too small to capture the things we would call “predictable plot patterns” in novels. However, it wasn’t hard to give this strategy a spin, so I did, using a language model called BERT to assess pairs of sentences from 32 biographies and 32 novels. (This is just a toy-sized sample for a semi-thought-experiment; I’m not pretending to finally resolve anything.) At each step, in each book, I asked BERT to judge the probability that sentence B would really follow sentence A. (The code I used is in a GitHub repo.)

The result I got was the opposite of the one reported in Sap et al. There is a statistically significant difference between biography and fiction, but the pairs of sentences in biography appeared more predictable—more likely to follow each other—than the sentences in fiction. I hasten to say, however, that this could be wrong in several ways. First, BERT’s perception that two sentences are likely to follow each other correlates strongly with the length of the sentences. Short sentences (like most sentences in dialogue) seem less clearly connected. Since there’s a lot of dialogue in published fiction, BERT might be, in effect, biased against fiction.

Fig. 1. Two different ways of measuring continuity between some sample sentences.

More importantly, sentence-level continuity isn’t necessarily a good measure of surprise in novel-length works. For instance, in fig. 1, you’ll notice that BERT is unruffled when Pride and Prejudice morphs into Flatland. As long as each sentence picks up some discursive cue from the one before, BERT perceives the pairs as plausibly connected. But by the fourth sentence in the chain, Mr Bennet is listening to a lecture from a translucent, blue, four-dimensional being in his sitting room. Human readers would probably be surprised if this happened.

There are ways to generate “sentence embeddings” that might correspond more closely to human surprise. (This is a crowded field, but see for instance Sentence-BERT, Reimers and Gurevych 2019.) Even primitive 2014-era GloVe embeddings do a somewhat better job (Pennington, Socher, and Manning 2014). By averaging the GloVe embeddings for all the words in a sentence, we can represent each sentence as a vector of length 300. Then we can measure the cosine distances between sentences, as I’ve done in the third column of Fig 1. (Here, large numbers indicate a big gap between sentences; it’s the reverse of the “probability” measure provided by BERT, where high numbers represent continuity.) This model of distance is (appropriately) more surprised by the humming blue sphere in row three than by the short sentence of dialogue in row five.

But even if we had a good measure of continuity, sentences might just be too small to capture the patterns that count as “predictability” in a novel. As the example in fig. 1 suggests, a sequence of short steps, individually unsurprising, can leave the reader in a world very different from the place they started. Continuity of this kind is not the “predictability” we would want to measure at book scale.

When readers talk about predictable or unpredictable stories, they’re probably thinking about specific problem situations and possible outcomes. Will the protagonist marry suitor A or suitor B? Can we guess? It may soon be possible to automatically extract implicit questions of this kind from fiction. And the Story Cloze task (Mostafazadeh et al.) showed that it’s possible to answer “what happens next” at paragraph scale. But right now I don’t know how to extract implicit questions, or answer them, at the scale of a novel. So let’s try a simpler—in fact minimal— predictive task. Given two passages selected at random from a book, can we predict which came first? Doing that won’t tell us anything about plot—if “plot” is a causal connection between events. But it will tell us whether book-length works are organized by any predictable large-scale patterns. (As we’ll see in a moment, this is a real question, and in some genres the answer might be “not really.”)

The vector-space representation we developed in the third column of Fig. 1 can be scaled up for this question. “Paragraphs” and “chapters” mean different things in different periods, so for now, it may be better simply to divide stories into arbitrary thousand-word passages. Each passage will be represented as a vector by averaging the GloVe embeddings for the words in it; we’ll subtract one passage from the other and use the difference to decide whether A came before B in the book, or vice-versa.

Fig. 2. Accuracy of sequence prediction for randomly selected pairs of passages from detective novels, or novels randomly selected from the whole Chicago Novel Corpus. Regularized logistic regression is trained on 47 volumes and tested on the 48th; the boxplots represent the range of mean accuracies for different held-out volumes.

Random accuracy for this task would be 50%, but a model trained on a reasonable number of novels can easily achieve 65-66%, especially if the novels are all in the same genre. That number may not sound impressive, but I suspect it’s not much worse than human accuracy would be—if a human reader were asked to draw the arrow of time connecting two random passages from an unfamiliar book.

In fact, why is it possible to do this at all? Since the two passages may be separated by a hundred-odd pages, our model clearly isn’t registering any logical relationships between events. Instead, it’s probably relying on patterns described in previous work by David McClure and Scott Enderle. McClure and Enderle have shown that there are strong linguistic gradients across narrative time in fiction. References to witnesses, guilt, and jail, for instance, tend to occur toward the end of a book (if they occur at all).

Fig. 3. David McClure, “A Hierarchical Cluster of Words Across Narrative Time,” 2017.

Our model may draw even stronger clues from simple shifts of rhetorical perspective like the one in figure 3: indefinite articles appear early in a book, when “a mysterious old man” enters “a room.” A few pages later, he will either acquire a name or become “the old man” in “the room.”

Fig. 4. David William McClure and Scott Enderle, “Distribution of Function Words Across Narrative Time in 50,000 Novels,” ADHO 2018.

We probably wouldn’t call that shift of perspective “plot.” On the other hand, before we dismiss these gradients as merely linguistic rather than narrative phenomena, it’s worth noting that they seem to be specific to fiction. When I try to use the same general strategy to predict the direction of time between pairs of passages in biographies, the model struggles to do better than random guessing. Even with the small toy sample I’m using below (32 novels and 32 biographies), there is clearly a significant difference between the two genres. So, although BERT may not see it, fictional narratives are more predictable than nonfiction ones when we back out to look at the gradient of time across a whole book. There is a much clearer difference between before and after in fiction.

Fig. 5. Range of accuracies for a regularized logistic regression model trained to identify the earlier of two 1000-word passages.

“A predictable difference between before and after” is something a good bit cruder than we ordinarily mean by “plot.” But the fact that this difference is specific to fiction makes me think that a model of this kind may after all confirm some part of what we meant in speculating “fictional plots are shaped by conventions that make them more predictable than nonfiction.”

Of course, to really understand plot, we will need to pair these loose book-sized arcs with a more detailed understanding of the way characters’ actions are connected as we move from one page to the next. For that kind of work, I invite you to survey the actual papers accepted for the Workshop on Narrative Understanding <gestures at the program>, which are advancing the state of the art, for instance, on event extraction.

But I can’t resist pointing out that even the crude vector-space model I have played with here can give us some leverage on page-level surprise, and in doing so, complicate the story I’ve just told. One odd detail I’ve noticed is that the predictability of a narrative at book scale (measured as our ability to predict the direction of time between two widely separated passages) correlates with a kind of unpredictability as we move from one sentence, page, or thousand-word passage to the next.

For instance, one way to describe the stability of a sequence is to measure “autocorrelation.” If we shift a time series relative to itself, moving it back by one step, how much does the original series correlate with the lagged version?

Fig 6. These are wholly imaginary curves to illustrate an idea.

A process with a lot of inertia (e.g., change in temperature across a year) might still have the same basic shape if we shift it backward eight hours. The amount of sunlight in Seattle, on the other hand, fluctuates daily and will be largely out of phase with itself if we shift it backward eight hours; the correlation between those two curves will be pretty low, or even negative as above.

Since we’re representing each passage of a book as a vector of 300 numbers, this gives us 300 time series—300 curves—for each volume. It is difficult to say what each curve represents; the individual components of a word embedding don’t come with interpretable labels. But we can measure the narrative’s general degree of inertia by asking how strongly these curves are, collectively, autocorrelated. Crudely: I shift each time series back one step (1000 words) and measure the Pearson correlation coefficient between the lagged and unlagged version. Then I take the mean correlation for all 300 series.*

Fig 7. Relationship between the volatility of the text (low autocorrelation) and accuracy of models that attempt to put two passages in the right order. Although there are more fiction volumes, we keep accuracy comparable by training on only 32 volumes at a time.

The result is unintuitive. You might think it would be easier to predict the direction of narrative time in books where variables change slowly—as temperature does—tracing a reliable arc. But instead it turns out that prediction is more accurate in books where these curves behave a bit like sunlight, fluctuating substantially every 1000 words. (The linear relationship with autocorrelation is r = -.237 in fig 7, though I suspect the real relationship isn’t linear.) Also, biography appears to be distinguished from fiction by higher autocorrelation (lower volatility).

So yes, fiction is more predictable than nonfiction across the sweep of a whole narrative (because the beginnings and ends of novels are rhetorically very distinct). But the same observation doesn’t necessarily hold as we move from page to page, or sentence to sentence. At that scale, fiction may be more volatile than nonfiction is. I don’t yet know why! We could speculate that this has something to do with an imperative to surprise the reader—but it might also be as simple as the alternation of dialogue and description, which creates a lot of rapid change in the verbal texture of fiction. In short, I’m pointing to a question rather than answering one. There appear to be several different kinds of “predictability” in narrative, and teasing them apart might give us some simple insights into the structural differences between fiction and nonfiction.

  • Postscript: Everything above is speculative and exploratory. I’ve shared some code and data in a repository, but I wouldn’t call it fully replicable. There are more sophisticated ways to measure autocorrelation. If any economists read this, it will occur to them that we could also “predict the future course of a story” using full vector autoregression or an ARIMA model. I’ve tried that, but my sense is that the results were actually dominated by the two factors explored separately above (before-and-after predictability and the autocorrelation of individual variables with themselves). Also, to make any of this really illuminate literary history, we will need a bigger and better corpus, allowing us to ask how patterns like this intersect with genre, prestige, and historical change. A group of researchers at Illinois, including Wenyi Shang and Peizhen Wu, are currently pursuing those questions.

References:

Edwin A. Abbott, Flatland: A Romance of Many Dimensions (London: 1884).

Sanjeev Arora, Yingyu Liang, Tengyu Ma, “A Simple but Tough-to-Beat Baseline for Sentence Embeddings.” ICLR 2017.

Austen, Jane. Pride and Prejudice. London: Egerton, 1813.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding. NAACL 2019.

Caroline Levine, The Serious Pleasures of Suspense (Charlottesville, University of Virginia Press, 2003).

David McClure, “A Hierarchical Cluster of Words Across Narrative Time,” 2017.

Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, James Allen. A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories. NAACL 2016.

Shay Palachy, “Document Embedding Techniques: A Review of Notable Literature on the Topic,” Towards Data Science, September 9, 2019.

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. EMNLP 2014.

Andrew Piper, Enumerations (Chicago: University of Chicago Press, 2018).

Nils Reimers and Iryna Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,” EMNLP-IJCNLP 2019.

Maarten Sap, Eric Horvitz, Yejin Choi, Noah A. Smith, James Pennebaker. Recollection Versus Imagination: Exploring Human Memory and Cognition via Neural Language Models. ACL 2020.

Vera Tobin, Elements of Surprise: Our Mental Limits and the Satisfactions of Plot (Cambridge: Harvard University Press, 2018).