Categories
19c 20c cultural analytics deep learning fiction plot plot

Can language models predict the next twist in a story?

While distant reading has taught us a lot about the history of fiction, it hasn’t done much yet to explain why we keep turning pages.

“Suspense” is the word we use to explain that impulse. But what is suspense? Does it require actual anxiety, or just uncertainty about what happens next? If suspense depends on not knowing what will happen, how can we enjoy re-reading familiar books? (Smuts 2009) And why do we enjoy being surprised? (Tobin 2018)

Beyond these big theoretical puzzles, there are historical questions scholars might like to ask about the way authors use chapter breaks to structure narrative revelation (Dames 2023, 219-38).

Right now, distant reading can’t fully answer any of these questions. When we want to measure surprise or novelty, for instance, we typically measure change in the verbal texture of a story from beginning to end. I made a coarse attempt of that kind in a blog post a few years ago. Other articles use better methods, and give us new ways to think about form (McGrath et al. 2018, Piper et al. 2023). But how closely does the pace of verbal change correlate with readers’ experience of uncertainty or surprise? We don’t know.

Autoregressive language models offer a tempting new angle on this problem, because they’re trained specifically to predict how a given text will continue. Intuitively, it feels like we could measure the predictability of a plot by first asking a model to continue the story, and then measuring the divergence between predicted continuation and real text. Even if this isn’t exactly how readers form expectations and experience surprise, it might begin to give us some leverage on the question.

Researchers have run a loosely similar experiment on very short stories contributed by experimental subjects (Sap et. al 2020). But scaling that up to published novels poses a challenge. For one thing, language models may not be equally good at imitating every style. A contemporary model’s failure to predict the next sentence by Jane Austen might just mean that it’s bad at channeling the Regency.

So, to factor style out of the question, let’s ask a model to predict what will happen in, say, the next three pages of a story — and then compare those predictions to its own summaries of the pages when it sees them.

Readers of a certain age will recognize this as a game Ernie invites Bert to play on “Sesame Street.”

Ernie asks Bert “what happens next” in this picture. Bert anticipates that the man will step in the pail, and disaster will ensue.

To spell the method out more precisely: we move through a novel roughly 900 words at a time. On each pass, we give a language model both a recap of earlier events, and a new 900-word passage. We ask the model to summarize the new passage, and also ask it to predict what will happen next. Then we compare its prediction to the summary it generates when it actually sees the next passage, and measure cosine distance between the two sentence embeddings. A large distance means the model did a poor job of predicting “what would happen next.”

Does this have any relation to human uncertainty?

I’m not claiming that this is a good model of the way readers experience plot. We don’t have a good model of that yet! The more appropriate question to ask is: Does this correlate at all with anything human readers do?

We can check by asking a reader to do the same thing: read roughly 900-word passages and make predictions about the future. Then we can compare the human reader’s predictions to automatically-generated summaries.

Passages were drawn from Now in November, by Josephine Johnson, and Murder is Dangerous, by Saul Levinson. n = 51 passages, Pearson’s r = .41, p < .01. Human predictions are more variable in quality than the model’s.

When I did this for two novels that were complete blanks to me, my predictions tended to diverge from the actual course of the story in roughly the same places where the model found prediction difficult. So there does seem to be some relationship between a language model’s (in)ability to see what’s coming and a human reader’s.

The image above also reveals that there are broad, consistent differences between books. For both people and models, some stories are easier to predict than others.

A reason not to trust this

Readers of this story may already anticipate the next twist—which is that of course we shouldn’t use LLMs to study uncertainty, because these models have already read many of the books we’re studying and will (presumably) already know the plot.

This is a particularly nasty problem because we don’t have a list of the books commercial models were trained on. We’re flying blind. But before we give up, let’s test how much of a problem this really poses. Researchers at Berkeley have defined a convenient test of the extent to which a model has memorized a book (Chang et al. 2023). In essence, they ask the model to fill in missing names.

Running this test, Chang et al. find that GPT-4 remembers many books in detail. Moreover, its ability to fill in masked names correlates with its accuracy on certain other tasks—like its ability to estimate date of publication. This could be a problem for questions about plot.

To avoid this problem (and also save money), I’ve been using GPT-3.5, which Chang et al. find is less prone to memorize books. But is that enough to address the problem? Let’s check. Below I’ve plotted the average divergence between prediction and summary for 25 novels on the y axis, and GPT-3.5’s ability to supply masked names in those texts on the x axis. If memorization was making prediction more accurate, we would expect to see a negative correlation: predictions’ divergence from summaries should go down as name_cloze accuracy goes up.

The y axis is average cosine distance between prediction and summary; x axis is GPT-3.5’s accuracy on the name cloze test defined in Chang et al.

25 books is not enough for a conclusive answer, but so far I cannot measure any pattern of that kind. (If anything, there is a faint trend in the opposite direction.)

In an ideal world, researchers would use language models trained on open data sets that they know and control. But until we get to an ideal world, it looks like it may be possible to run proof-of-concept experiments with things like GPT-3.5, at least if we avoid extremely famous books.

Scrutinizing the image above, readers will probably notice that the most predictable book in this sample was Zoya, by Danielle Steel. Although Steel has a reputation that may encourage disparaging inferences — see Dan Sinykin, Big Fiction, for why — I don’t think we’re in a position to draw those inferences yet. The local rhythms that make prediction possible across three pages are not necessarily what critics mean when they use “predictable” to diss a book.

So what could we learn from predicting the next three pages?

To consider one possible payoff: it might give us a handle on the way chapter-breaks, and other divides, structure the epistemic rhythms of fiction. For instance, many readers have noticed that the installments of novels originally published in magazines tend to end with an explicit mystery to ensure that you keep reading (Haugtvedt 2016 and Beekman 2017). In the first installment of Arthur Conan Doyle’s Hound of the Baskervilles (which covers two chapters), Watson and Holmes learn about a legendary curse that connects the family of the Baskervilles to a fiendish hound. In the final lines of the first installment, Holmes asks the family doctor about footprints found near the body of Sir Charles Baskerville. “A man’s or a woman’s?” Holmes asks. The doctor’s “voice sank almost to a whisper as he answered. ‘Mr Holmes, they were the footprints of a gigantic hound!'” End installment.

Novels don’t have soundtracks. But unexplained, suggestive new information is as good as a sting: “Bum – bum – BUM!”

The Hound of the Baskervilles, by Sidney Paget, 1902.

It appears that we can measure this cliffhanger effect: the serial installments of The Hound of the Baskervilles often end with a moment of heightened mystery — at least, if inability to predict the next three pages is any measure of mystery. When we measure predictive accuracy throughout the story, making four different passes to ensure we have roughly-900-word chunks aligned with all the chapter breaks, we find that predictions are farther from reality at the ends of serial installments. There is no similar effect at other chapter breaks.

The mean for breaks at serial installments is more than one standard deviation above the mean for other chapter-breaks. In spite of tiny n, this is actually p < .05.

Now, this is admittedly a cherry-picked example. So far, I have only looked at seven novels where we can distinguish the ends of serial installments from other kinds of chapter break (using data from Warhol et al). And I don’t see this pattern in all of them.

So I’m not yet making any historical argument about serialization and the rise of the cliffhanger. I’m just suggesting that it’s the kind of question someone could eventually address using this method. A doctoral student could do it, for instance, with a locally hosted model. (I don’t recommend doing it with GPT-3.5, because I dropped $150 or so on this post, and that might add up across a dissertation.) Some initial tests suggest to me that this approach will produce results significantly different than we’re getting with lexical methods.

Since I’m explicitly encouraging people to run with this, let me also say that someone actually writing a paper using this method might want to tinker with several things before trusting it:

  1. Measuring the distance between the embedding of one prediction sentence and one summary sentence is a crude way to measure expectation and surprise. Readers don’t necessarily form a single expectation about plot. Maybe it would be better to model expectation as a range of possibility?
  2. Related to this: models may need to be nudged to speculate and not just predict that current actions will continue.
  3. 900-word chunks may not be the only appropriate scale of analysis. When readers talk about narrative surprise they’re often thinking about larger arcs like “who will he marry?” or “who turns out to be the murderer?”
  4. We need a way to handle braided narratives where each chapter is devoted to a different group of characters (Garrett 1980). In a multi-plot story, the B or C plot will often not continue across a chapter break.

But we’re in a multi-plot narrative ourselves, so those problems may be solved by a different group of characters. This was just a blog post to share an idea and get people arguing about it. Tune in next time, for our thrilling conclusion. (Bum – bum – BUM!)

Code and data used for this post are available on Github.

The ideas discussed here were previously presented in Paris at a workshop on AI for the analysis of literary corpora, and in Copenhagen in a conference on generative methods in the social sciences and humanities. I’d like to thank the organizers of those events, esp. Thierry Poibeau, Anders Munk, and Rolf Lund, for stimulating conversation — and also many people in attendance, especially David Bamman, Lynn Cherny, and Meredith Martin. In writing code to query the OpenAI API, I borrowed snippets from Quinn Dombrowski (and also of course from GPT-4 itself oh brave new world &c). My thinking about 19c serialization was advanced by conversation with David Bishop and Eleanor Courtemanche, and by suggestions from Ryan Cordell and Elizabeth Foxwell on Bluesky.

References

Beekman, G. (2017) “Emotional Density, Suspense, and the Serialization of The Woman in White in All theYear Round.” Victorian Periodicals Review 50.1.

Chang, K., Cramer, M., Soni, S., & Bamman, D. (2023) “Speak, Memory: An Archaeology of Books known to ChatGPT / GPT-4,” https://arxiv.org/abs/2305.00118.

Dames, N. (2023) The Chapter: A Segmented History from Antiquity to the Twenty-First Century. Princeton University Press.

Garrett, P. (1980) The Victorian Multiplot Novel: Studies in Dialogical Form. Yale University Press.

Haugtvedt, E. (2016) “The Sympathy of Suspense: Gaskell and Braddon’s Slow and Fast Sensation Fiction in Family Magazines.” Victorian Periodicals Review 49.1.

McGrath, L., Higgins, D., & Hintze, A. (2018) “Measuring Modernist Novelty.” Journal of Cultural Analytics. https://culturalanalytics.org/article/11030-measuring-modernist-novelty

Piper, A., Xu, H., & Kolaczyk, E. D. (2023) “Modeling Narrative Revelation.” Computational Humanities Research 2023. https://ceur-ws.org/Vol-3558/paper6166.pdf

Sap, M., Horvitz, E., Choi, Y., Smith, N. A., & Pennebaker, J. (2020) “Recollection Versus Imagination: Exploring Human Memory and Cognition via Neural Language Models.” Proceedings of the 58th Annual Meeting of the ACL. https://aclanthology.org/2020.acl-main.178/

Sinykin, Dan. Big Fiction: How Conglomeration Changed the Publishing Industry and American Literature. Columbia University Press, 2023.

Smuts, A. (2009) “The Paradox of Suspense.” Stanford Encyclopedia of Philosophy. https://plato.stanford.edu/entries/paradox-suspense/#SusSur

Tobin, V. (2018) The Elements of Surprise: Our Mental Limits and the Satisfactions of Plot. Harvard University Press.

Warhol, Robyn, et al., “Reading Like a Victorian,” https://readinglikeavictorian.osu.edu/.

Categories
deep learning social effects of machine learning

Liberally-educated students need to be more than consumers of AI

The initial wave of controversy over large language models in education is dying down. We haven’t reached consensus about what to do yet. (Retreat to in-class exams? make tougher writing assignments? just forbid the use of AI?) But it’s clear to everyone now that the models will require some response.

In a month or so there will be a backlash to this debate, as professors discover that today’s models are still not capable of writing a coherent, footnoted, twelve-page research paper on their own. We may tell ourselves that the threat to education was overhyped, and congratulate ourselves on having addressed it.

That will be a mistake. We haven’t even begun to discuss the challenge AI poses for education.

For professors, yes, the initial disruption will be easy to address: we can find strategies that allow us to continue evaluating student work while teaching the courses we’re accustomed to teach. Problem solved. But the challenge that really matters here is a challenge for students, who will graduate into a world where white-collar work is being redefined. Some things will get easier: we may all have assistants to help us handle email. But by the same token, students will be asked to tackle bigger challenges.

Our vision of those challenges is confined right now by a discourse that treats models as paper-writing machines. But that’s hardly the limit of their capacity. For instance, models can read. So a lawyer in 2033 may be asked to “use a model to do a quick scan of new case law in these thirty jurisdictions and report back tomorrow on the implications for our project.” But then, come to think of it, a report is bulky and static. So you know what, “don’t write a report. What I really need is a model that’s prepared to provide an overview and then answer questions on this topic as they emerge in meetings over the next week.”

A decade from now, in short, we will probably be using AI not just to gather material and analyze it, but to communicate interactively with customers and colleagues. All the forms of critical thinking we currently teach will still have value in that world. It will still be necessary to ask questions about social context, about hidden assumptions, and about the uncertainty surrounding any estimate. But our students won’t be prepared to address those questions unless they also know enough about machine learning to reason about a model’s assumptions and uncertainty. At higher levels of responsibility, this will require more than being a clever prompter and savvy consumer. White-collar professionals are likely to be fine-tuning their own models; they will need to choose a base model, assess training strategies, and decide whether their models are over-confident or over-cautious.

Midjourney: “a college student looking at helpful utility robots in a department store window, HD photography, 80 mm lens –ar 17:9

The core problem here is that we can’t fully outsource thinking itself. I can trust Toyota to build me a good car, although I don’t know how fuel injection works. Maybe I read Consumer Reports and buy what they recommend? But if I’m buying a thinking process, I really need to understand what I see when I look under the hood. Otherwise “critical thinking” loses all meaning.

Academic conversation so far has not seemed to recognize this challenge. We have focused on preserving existing assignments, when we should be talking about the new courses and new assignments students will need to think critically in the 2030s.

Because AI has been framed as a collection of opaque gadgets, I know this advice will frustrate many readers. “How can we be expected to prepare students for the 2030s? It’s impossible to know the technical details of the tools that will be available. Besides, no one understands how models work. They’re black boxes. The best preparation we can provide is a general attitude of caveat emptor.”

This is an understandable response, because things moved too quickly in the past four years. We were still struggling to absorb the basic principles of statistical machine learning when statistical ML was displaced by a new generation of tools that seemed even more mysterious. Journalists more or less gave up on explaining things.

But there are principles that undergird machine learning. Statistical learning is really, at bottom, a theory of learning: it tries to describe mathematically what it means to generalize about examples. The concept of a “bias-variance tradeoff,” for instance, allows us to reason more precisely about the intuitive insight that there is some optimal level of abstraction for our models of the world.

Illustration borrowed from https://upscfever.com/upsc-fever/en/data/deeplearning2/2.html.

Deep learning admittedly makes things more complex than the illustration above implies. (In fact, understanding the nature of the generalization performed by LLMs is still an exciting challenge — see the first few minutes of this recent talk by Ilya Sutskever for an example of reflection on the topic.) But if students are going to be fine-tuning their own models, they will definitely need at least a foundation in concepts like “variance” and “overfitting.” A basic course on statistical learning should be part of the core curriculum.

I might go a little further, and suggest that professors in every department are going to want reflect on principles and applications of machine learning, so we can give students the background they need to keep thinking critically about our domain of expertise in a world where some (not all) aspects of reading and analysis may be automated.

Is this a lot of work? Yes. Are we already overburdened, and should we have more support? Yes and yes. Professors have already spent much of their lives mastering one field of expertise; asking them to pick up the basics of another field on the fly while doing their original jobs is a lot. So adapting to AI will happen slowly and it will be imperfect and we should all cut each other slack.

But to look on the bright side: none of this is boring. It’s not just a technical hassle to fend off. There are fundamental intellectual challenges here, and if we make it through these rapids in one piece, we’re likely to see some new things.

Categories
deep learning reproducibility and replication social effects of machine learning

We can save what matters about writing—at a price

It’s beginning to sink in that generative AI is going to force professors to change their writing assignments this fall. Corey Robin’s recent blog post is a model of candor on the topic. A few months ago, he expected it would be hard for students to answer his assignments using AI. (At least, it would require so much work that students would effectively have to learn everything he wanted to teach.) Then he asked his 15-year-old daughter to red-team his assignments. “[M]y daughter started refining her inputs, putting in more parameters and prompts. The essays got better, more specific, more pointed.”

Perhaps not every 15-year-old would get the same result. But still. Robin is planning to go with in-class exams “until a better option comes along.” It’s a good short-term solution.

In this post, I’d like to reflect on the “better options” we may need over the long term, if we want students to do more thinking than can fit into one exam period.

If you want an immediate pragmatic fix, there is good advice out there already about adjusting writing assignments. Institutions have not been asleep at the wheel. My own university has posted a practical guide, and the Modern Language Association and Conference on College Composition and Communication have (to their credit) quickly drafted a working paper on the topic that avoids panic and makes a number of wise suggestions. A recurring theme in many of these documents is “the value of process-focused instruction” (“Working Paper,” 10).

Why focus on process? A cynical way to think about it is that documenting the writing process makes it harder for students to cheat. There are lots of polished 5-page essays out there to imitate, but fewer templates that trace the evolution of an idea from an initial insight, through second thoughts, to dialectical final draft.

Making it harder to cheat is not a bad idea. But the MLA-CCCC task force doesn’t dwell on this cynical angle. Instead they suggest that we should foreground “process knowledge” and “metacognition” because those things were always the point of writing instruction. This is much the same thesis Corey Robin explores at the end of his post when he compares writing to psychotherapy: “Only on the couch have I been led to externalize myself, to throw my thoughts and feelings onto a screen and to look at them, to see them as something other, coldly and from a distance, the way I do when I write.”

Midjourney: “a hand writing with a quill reflected in a mirror, by MC Escher, in the style of meta-representation –ar 3:2 –weird 50”

Robin’s spin on this insight is elegiac: in losing take-home essays, we might lose an opportunity to teach self-critique. The task force spins it more optimistically, suggesting that we can find ways to preserve metacognition and even ways to use LLMs (large language models) to help students think about the writing process.

I prefer their optimistic spin. But of course, one can imagine an even-more-elegiac riposte to the task force report. “Won’t AI eventually find ways to simulate critical metacognition itself, writing the (fake) process reflection along with the final essay?”

Yes, that could happen. So this is where we reach the slightly edgier spin I feel we need to put on “teach the process” — which is that, over the long run, we can only save what matters about writing if we’re willing to learn something ourselves. It isn’t a good long-term strategy for us to approach these questions with the attitude that we (professors) have a fixed repository of wisdom — and the only thing AI should ever force us to discuss is, how to convey that wisdom effectively to students. If we take that approach, then yes, the game is over as soon as a model learns what we know. It will become possible to “cheat” by simulating learning.

But if the goal of education is actually to learn new things — and we’re learning those things along with our students — then simulating the process is not something to fear. Consider assignments that take the form of an experiment, for instance. Experiments can be faked. But you don’t get very far doing so, because fake experiments don’t replicate. If a simulated experiment does reliably replicate in the real world, we don’t call that “cheating” — but “in-silico research that taught us something new.”

If humanists and social scientists can find cognitive processes analogous to experiment — processes where a well-documented simulation of learning is the same thing as learning — we will be in the enviable position Robin originally thought he occupied: students who can simulate the process of doing an assignment will effectively have completed the assignment.

I don’t think most take-home essays actually occupy that safe position yet, because in reality our assignments often ask students to reinvent a wheel, or rehearse a debate that has already been worked through by some earlier generation. A number of valid (if perhaps conflicting) answers to our question are already on record. The verb “rehearse” may sound dismissive, but I don’t mean this dismissively. It can have real value to walk in the shoes of past generations. Sometimes ontogeny does need to recapitulate phylogeny, and we should keep asking students to do that, occasionally — even if they have to do it with pencil on paper.

But we will also need to devise new kinds of questions for advanced students—questions that are hard to answer even with AI assistance, because no one knows what the answer is yet. One approach is to ask students to gather and interpret fresh evidence by doing ethnography, interviewing people, digging into archival boxes, organizing corpora for text analysis, etc. These are assignments of a more demanding kind than we have typically handed undergrads, but that’s the point. Some things are actually easier now, and colleges may have to stretch students further in order to challenge them.

“Gathering fresh evidence” puts the emphasis on empirical data, and effectively preserves the take-home essay by turning it into an experiment. What about other parts of humanistic education: interpretive reflection, theory, critique, normative debate? I think all of those matter too. I can’t say yet how we’ll preserve them. It’s not the sort of problem one person could solve. But I am willing to venture that the meta-answer is, we’ll preserve these aspects of education by learning from the challenge and adapting these assignments so they can’t be fulfilled merely by rehearsing received ideas. Maybe, for instance, language models can help writers reflect explicitly on the wheels they’re reinventing, and recognize that their normative argument requires another twist before it will genuinely break new ground. If so, that’s not just a patch for writing assignments — but an advance for our whole intellectual project.

I understand that this is an annoying thesis. If you strip away the gentle framing, I’m saying that we professors will have to change the way we think in order to respond to generative AI. That’s a presumptuous thing to say about disciplines that have been around for hundreds of years, pursuing aims that remained relatively constant while new technologies came and went.

However, that annoying thesis is what I believe. Machine learning is not just another technology, and patching pedagogy is not going to be a sufficient response. (As Marc Watkins has recently noted, patching pedagogy with surveillance is a cure worse than the disease.) This time we can only save what matters about our disciplines if we’re willing to learn something in the process. The best I can do to make that claim less irritating is to add that I think we’re up for the challenge. I don’t feel like a voice crying in the wilderness on this. I see a lot of recent signs — from the admirable work of the MLA and CCCC to books like The Ends of Knowledge (eds. Scarborough and Rudy) — that professors are thinking creatively about a wide range of recent challenges, and are capable of responding in ways that are at once critical and self-critical. Learning is our job. We’ve got this.

References

Center for Innovation in Teaching and Learning, UIUC. “Artificial Intelligence Implications in Teaching and Learning.” Champaign, IL, 2023.

MLA-CCCC Joint Task Force on Writing and AI, “MLA-CCCC Joint Task Force on Writing and AI Working Paper: Overview of the Issues, Statement of Principles, and Recommendations,” July 2023.

Robin, Corey. “How ChatGPT Changed My Plans for the Fall,” July 30, 2023.

Rudy, Seth, and Rachel Scarborough King, The Ends of Knowledge: Outcomes and Endpoints across the Arts and Sciences. London: Bloomsbury, 2023.

Watkins, Marc. “Will 2024 look like 1984?” July 31, 2023.

Categories
deep learning fiction

Using GPT-4 to measure the passage of time in fiction

Language models have been compared to parrots, but the bigger danger is that they turn people into parrots. A student who asks for “a paper about Middlemarch,” for instance, will get a pastiche loosely based on many things in the model’s training set. This may not count as plagiarism, but it won’t produce anything new.

But there are ways to use language models actively and creatively. We can select evidence to be analyzed, put it in a prompt, and specify the questions to be asked. Used this way, language models can create new knowledge that didn’t exist when they were trained. There are many ways to do this, and people may eventually get quite creative. But let’s start with a familiar task, so we can evaluate the results and decide whether language models really help. An obvious place to start is “content analysis”—a research method that analyzes hundreds or thousands of documents by posing questions about specific themes.

Below I work through a simple example of content analysis using the OpenAI API to measure the passage of time in fiction (see this GitHub repo for code). To spoil the suspense: I find that for this task, language models add something valuable, not just because they’re more accurate than older ways of doing this at scale but because they explain themselves better.

Why measure time in fiction?

Researchers already have several good ways to automate content analysis. Named entity extraction addresses certain kinds of questions. Topic modeling addresses others.

But there are also tricky questions that remain hard to answer by counting words. In 2017, for instance, I started to wonder how much time passes, on average, across a page of a novel. Literary-critical tradition suggested that there had been a pretty stable balance between “scene” (minute-by-minute description) and “summary” (which may cover weeks or years) until modernists started to leave out the summary and make every page breathlessly immediate [1]. But when I sat down with two graduate students (Sabrina Lee and Jessica Mercado) to manually characterize a thousand passages from fiction, we found instead a long trend. The average length of time represented in 250 words of fiction had been getting steadily shorter since the early eighteenth century. There was a trend toward immediacy, in other words, but modernism didn’t begin the trend [2].


The average length of time described in 250 words of narration. Y axis is logarithmic. The shaded ribbon represents a 95% confidence interval for the dashed curve, which is itself calculated by loess regression. From Underwood, “Why Literary Time is Measured in Minutes,” p. 352.

How well do bag of-words methods estimate time?

We decided to characterize passages manually because references to time in fiction can’t always be taken literally. If a character thinks (or says) “wow, it’s been thirty years but feels like yesterday,” you don’t want to conclude that thirty years have passed on the page. So word-counting seemed risky. Even as human readers we often found it hard to decide how much time was passing. But when a passage was read by two different people, our estimates agreed with each other well enough to conclude that “fictive time” was a meaningful construct, if not a precise one (r = .74 on log-transformed durations).

A year after we published our paper, Greg Yauney showed that word-counting methods can do an acceptable job of estimating time [3]. He trained a bag-of-words model on the passages we had labeled, and applied it to a new set of passages labeled by new readers. The model-predicted durations correlated with human estimates at r = .35. While this is much lower than inter-human agreement, the model was stable enough to precisely measure a trend (across thousands of books) that matched the trend we had sketched using laborious human reading of a hundred books.

From Yauney, Underwood, and Mimno, “Computational Prediction of Elapsed Narrative Time.”

Replicating this on our old data, I get a slightly higher correlation between models and readers (r = .49 on log-transformed durations), perhaps because the data I’m using was produced by readers who compared notes for several weeks to maximize agreement. But since this is still lower than .74, bag-of-words models are definitely less good at estimating time than human readers.

Can we train LLMs to estimate time?

To get a large language model to answer a question, you first need to make sure it understands the question. As Simon Willison helpfully explains, the way to train a chat model through the API is to demonstrate the interaction you want by providing a series of imagined exchanges between a “user” and an “assistant” [4]. These aren’t real replies from the assistant but ego ideals you’re providing to teach it how to behave.

Here’s the instruction I had the “user” give:

Read the following passage of fiction. Then do five things.
 
1: Briefly summarize the passage.
2: Reason step by step to decide how much time is described in the passage. If the passage doesn't include any explicit reference to time, you can guess how much time the events described would have taken. Even description can imply the passage of time by describing the earlier history of people or buildings. But characters' references to the past or future in spoken dialogue should not count as time that passed in the scene. Report the time using units of years, weeks, days, hours, or minutes. Do not say zero or N/A.
3: If you described a range of possible times in step 2 take the midpoint of the range. Then multiply to convert the units into minutes.
4: Report only the number of minutes elapsed, which should match the number in step 3. Do not reply N/A.
5: Given the amount of speculation required in step 2, describe your certainty about the estimate--either high, moderate, or low.
The passage follows: <omitting this to save space>

You’ll notice that I’m using chain-of-thought prompting, building slowly from summary to inference to an estimated number of minutes. You may also notice I tell the model twice “don’t reply N/A.” I found the model was pretty cautious about speculating and needed to be encouraged to make assumptions and guesses. But to be honest, it’s possible that the model was right to be cautious! Duration is hard to estimate in some cases, and maybe we humans should have allowed N/A responses when we did this.

Here’s an example of an ego-ideal reply I provided to teach the “assistant” how to respond. This was responding to a passage from The Big Sleep.

1: A detective wakes up 'the next morning,' looks out a window for an undefined time, drinks (and presumably needs to make) two cups of coffee, then shaves and showers and gets dressed before stepping out his front door and seeing a car.
2: Making coffee, showering, and getting dressed take at least an hour. There's some ambiguity about whether to count the implicit reference to yesterday (since this is 'the next morning') as time elapsed in the passage, but let's say no, since yesterday is not actually described. So, an hour to 90 minutes.
3: 1.25 hours have elapsed. Multiplying by 60 minutes an hour that's 75 minutes.
4: 75 minutes.
5: Low confidence, because of ambiguity about a reference to the previous day.

Using some code generously shared by Quinn Dombrowski, I gave the model four query-reply sequences like this, then asked it to characterize a new passage with the same instructions.

To assess performance on this simple task, I just extracted the minutes reported in step 4 and compared them to our human estimates from 2017. Other forms of content analysis might require the model to edit or mark up the text provided in the prompt. That’s doable, but I thought I would start with something simple.

I added step 5 (allowing the model to describe its own confidence) because in early experiments I found the model’s tendency to editorialize extremely valuable. The answers to step 5 were also fun to read as replies scrolled up the page, because my new sorcerer’s assistant complained volubly about the ambiguity of its task. (“30 minutes. Low confidence, as the passage is more focused on the poetic and symbolic aspects of the scene rather than providing a clear sense of time.”). In some cases it refused the task altogether and threw the question about confidence back in my face. (“N/A. High confidence that no specific amount of time is described.”) This capacity for backtalk is a feature not a bug: I learned a lot from it.

But after I adjusted my prompt to address the ambiguities the assistant correctly pointed out, complete refusal to answer the question was rare.

How well does GPT-4 estimate time?

I had the Turbo model code 483 passages. Its predictions correlated with human estimates at r = .59. I only asked GPT-4 to code 121 passages (because it’s more expensive to run than Turbo), and it achieved r = .68. This is not as good as inter-human agreement (.74), but it’s closer to human readers than to bag of words models (.35 – .49). And of course GPT-4 does the work more quickly than human readers. It took the three of us several months to generate this data, but my LLM experiment was run in a couple of days. Plus, given an API, large language models are easier to use than other forms of machine learning: the main challenge is to describe your question systematically. This is a method that could realistically be used by researchers with relatively little programming experience.

The total cost to my OpenAI account was $22. Of that amount, about $8 was testing the prompts and running on the (cheaper) Turbo API. The final run on GPT-4 cost $14; if I had completely replicated our work from 2017 on GPT-4 it would have cost $140. So a project like this will not bankrupt researchers. But reliance on OpenAI is still a bad idea in the long run. Universities should develop their own models and APIs, both for reasons of institutional autonomy/survival (!!😬!!) and so that researchers can inspect and tune the models.

However, one could exaggerate the problems of interpretability created by a closed model. Watching the model think out loud about each passage actually gave me a better sense of how results were produced than I could extract from my own records of human research in 2017. (See below for an example from GPT-4: human readers don’t usually document their thought process this verbosely.) Also, because human beings can’t erase their own memories, it’s not possible for us to “repeat the same question with slightly different language.” If I try that on myself, I’m going to remember what I thought before and probably think it again. But a model can wake up each moment with eternal sunshine of the spotless mind.

ABRIDGED PASSAGE FROM Jane Eyre: And signs, for aught we know, may be but the sympathies of Nature with man. When I was a little girl, only six years old, I one night heard Bessie Leaven say to Martha Abbot that she had been dreaming about a little child ... Of late I had often recalled this saying and this incident; for during the past week scarcely a night had gone over my couch that had not brought with it a dream of an infant, which I sometimes hushed in my arms, sometimes dandled on my knee, sometimes watched playing with daisies on a lawn, or again, dabbling its hands in running water. It was a wailing child this night, and a laughing one the next: now it nestled close to me, and now it ran from me; but whatever mood the apparition evinced, whatever aspect it wore, it failed not for seven successive nights to meet me the moment I entered the land of slumber.

ASSISTANT REPLIES. 1: The narrator recalls a saying about dreaming of children being a sign of trouble and remembers an incident from her childhood. She then describes her recent dreams of infants over the past week. 
2: The passage mentions the narrator's childhood memory, which happened when she was six years old, and the past week of dreams. The time elapsed in the passage is the week of dreams. 
3: 1 week has elapsed. At seven days a week that's 7 days. 24 hours in a day, so 168 hours. 60 minutes in an hour, so 10080 minutes. 
4: 10080 minutes. 
5: Moderate confidence, as the childhood memory is not counted as time passing in the scene.

What no model can do

While I think large language models will be tremendously useful assistants in content analysis, I don’t think we can dispense with multiple human readers. As David Bamman explains in a good Twitter thread, the core challenges of this work are “a. coming up with a new construct to measure, b. demonstrating that it can be measured, and c. showing that there’s value in doing so.” Intersubjective agreement between human beings is still the only way to know that we have addressed those questions.

That’s why I wouldn’t feel comfortable just making up my own definition of, say, “suspense,” teaching a model to measure it, and then running the model across a thousand books. The problem with that approach is not that LLMs have measurement error. (As we’ve seen in Greg Yauney’s paper, measurement methods with much more error can be productive.) The problem is that it isn’t clear what a single researcher’s construct means in the first place. To be confident that we’re measuring something called “suspense” we need to show that multiple people recognize it as suspense. And in spite of all the rhetorical personification in this blog post (which I trust you have understood rhetorically), a model is not a separate person. So a project of this kind still needs to start by getting several human beings to read systematically and compare notes, before the human codebook is translated into a prompt.

On the other hand, I do think language models will allow us to pose questions that are currently hard to pose at scale. Questions about plot and character, for instance, require reading a whole book while applying delicate decision criteria that may be hard to remember and keep stable across many books. Language models will help here not just because they’re fast, but because they can provide a relatively stable yardstick by which to measure slippery concepts, even at a modest scale of analysis.

This is a preliminary report on a very strange world. For me, the most surprising take-away from this experiment was not that deep learning is more accurate than statistical NLP, but that it may also be in some ways more interpretable. Because a language model has to think out loud, it tends to automatically document its own reasoning. This is useful even when (or especially when) the model doesn’t think you’ve defined the construct it’s supposed to measure clearly enough.

References

[1] Gérard Genette, Narrative Discourse: An Essay in Method, trans. Jane E. Lewin (Ithaca: Cornell Univ. Press, 1980), 97.

[2] Ted Underwood, “Why Literary Time is Measured in Minutes,” New Literary History 85.2 (2018) 351-65.

[3] Greg Yauney, Ted Underwood, and David Mimno, “Computational Prediction of Elapsed Narrative Time” (2019).

[4] Simon Willison, “A Simple Python Wrapper for the ChatGPT API” (2023).

Categories
interpretive theory machine learning social effects of machine learning transformer models

Mapping the latent spaces of culture

[On Tues Oct 26, the Center for Digital Humanities at Princeton will sponsor a roundtable on the implications of “Stochastic Parrots” for the humanities. To prepare for that roundtable, they asked three humanists to write position papers on the topic. Mine follows. I’ll give a 5-min 500-word précis at the event itself; this is the 2000-word version, with pictures. It also has a DOI if you want a stable version to cite.]

The technology at the center of this roundtable doesn’t yet have a consensus name. Some observers point to an architecture, the Transformer.[1] “On the Dangers of Stochastic Parrots” focuses on size and discusses “large language models.”[2] A paper from Stanford emphasizes applications: “foundation models” are those that can adapt “to a wide range of downstream tasks.”[3] Each definition identifies a different feature of recent research as the one that matters. To keep that question open, I’ll refer here to “deep neural models of language,” a looser category.

However we define them, neural models of language are already changing the way we search the web, write code, and even play games. Academics outside computer science urgently need to discuss their role. “On the Dangers of Stochastic Parrots” deserves credit for starting the discussion—especially since publication required tenacity and courage. I am honored to be part of an event exploring its significance for the humanities.

The argument that Bender et al. advance has two parts: first, that large language models pose social risks, and second, that they will turn out to be “misdirected research effort” anyway, since they pretend to perform “natural language understanding” but “do not have access to meaning” (615).

I agree that the trajectory of recent research is dangerous. But to understand the risks language models pose, I think we will need to understand how they produce meaning. The premise that they simply “do not have access to meaning” tends to prevent us from grasping the models’ social role. I hope humanists can help here by offering a wider range of ways to think about the work language does.

It is true that language models don’t yet represent their own purposes or an interlocutor’s state of mind. These are important aspects of language, and for “Stochastic Parrots,” they are the whole story: the article defines meaning as “meaning conveyed between individuals” and “grounded in communicative intent” (616). 

But in historical disciplines, it is far from obvious that all meaning boils down to intentional communication between individuals. Historians often use meaning to describe something more collective, because the meaning of a literary work, for example, is not circumscribed by intent. It is common for debates about the meaning of a text to depend more on connections to books published a century earlier (or later) than on reconstructing the author’s conscious plan.[4]

I understand why researchers in a field named “artificial intelligence” would associate meaning with mental activity and see writing as a dubious proxy for it. But historical disciplines rarely have access to minds, or even living subjects. We work mostly with texts and other traces. For this reason, I’m not troubled by the part of “Stochastic Parrots” that warns about “the human tendency to attribute meaning to text” even when the text “is not grounded in communicative intent” (618, 616). Historians are already in the habit of finding meaning in genres, nursery rhymes, folktale motifs, ruins, political trends, and other patterns that never had a single author with a clear purpose.[5] If we could only find meaning in intentional communication, we wouldn’t find much meaning in the past at all. So not all historical researchers will be scandalized when we hear that a model is merely “stitching together sequences of linguistic forms it has observed in its vast training data” (617). That’s often what we do too, and we could use help.

A willingness to find meaning in collective patterns may be especially necessary for disciplines that study the past. But this flexibility is not limited to scholars. The writers and artists who borrow language models for creative work likewise appreciate that their instructions to the model acquire meaning from a training corpus. The phrase “Unreal Engine,” for instance, encourages CLIP to select pictures with a consistent, cartoonified style. But this has nothing to do with the dictionary definition of “unreal.” It’s just a helpful side-effect of the fact that many pictures are captioned with the name of the game engine that produced them.

In short, I think people who use neural models of language typically use them for a different purpose than “Stochastic Parrots” assumes. The immediate value of these models is often not to mimic individual language understanding, but to represent specific cultural practices (like styles or expository templates) so they can be studied and creatively remixed. This may be disappointing for disciplines that aspire to model general intelligence. But for historians and artists, cultural specificity is not disappointing. Intelligence only starts to interest us after it mixes with time to become a biased, limited pattern of collective life. Models of culture are exactly what we need.

While I’m skeptical that language models are devoid of meaning, I do share other concerns in “Stochastic Parrots.” For instance, I agree that researchers will need a way to understand the subset of texts that shape a model’s response to a given prompt. Culture is historically specific, so models will never be free of omission and bias. But by the same token, we need to know which practices they represent. 

If companies want to offer language models as a service to the public—say, in web search—they will need to do even more than know what the models represent. Somehow, a single model will need to produce a picture of the world that is acceptable to a wide range of audiences, without amplifying harmful biases or filtering out minority discourses (Bender et al., 614). That’s a delicate balancing act.  

Historians don’t have to compress their material as severely. Since history is notoriously a story of conflict, and our sources were interested participants, few people expect historians to represent all aspects of the past with one correctly balanced model. On the contrary, historical inquiry is usually about comparing perspectives. Machine learning is not the only way to do this, but it can help. For instance, researchers can measure differences of perspective by training multiple models on different publication venues or slices of the timeline.[6]

When research is organized by this sort of comparative purpose, the biases in data are not usually a reason to refrain from modeling—but a reason to create more corpora and train models that reflect a wider range of biases. On the other hand, training a variety of models becomes challenging when each job requires thousands of GPUs. Tech companies might have the resources to train many models at that scale. But will universities? 

There are several ways around this impasse. One is to develop lighter-weight models.[7] Another is to train a single model that can explicitly distinguish multiple perspectives. At present, researchers create this flexibility in a rough and ready way by “fine-tuning” BERT on different samples. A more principled approach might design models to recognize the social structure in their original training data. One recent paper associates each text with a date stamp, for instance, to train models that respond differently to questions about different years.[8] Similar approaches might produce models explicitly conditioned on variables like venue or nationality—models that could associate each statement or prediction they make with a social vantage point.

If neural language models are to play a constructive role in research, universities will also need alternatives to material dependence on tech giants. In 2020, it seemed that only the largest corporations could deploy enough silicon to move this field forward. In October 2021, things are starting to look less dire. Coalitions like EleutherAI are reverse-engineering language models.[9] Smaller corporations like HuggingFace are helping to cover underrepresented languages. NSF is proposing new computing resources.[10] The danger of oligopoly is by no means behind us, but we can at least begin to see how scholars might train models that represent a wider range of perspectives.

Of course, scholars are not the only people who matter. What about the broader risks of language modeling outside universities?

I agree with the authors of “Stochastic Parrots” that neural language models are dangerous. But I am not sure that critical discourse has alerted us to the most important dangers yet. Critics often prefer to say that these models are dangerous only because they don’t work and are devoid of meaning. That may seem to be the strongest rhetorical position (since it concedes nothing to the models), but I suspect this hard line also prevents critics from envisioning what the models might be good for and how they’re likely to be (mis)used.

Consider the surprising art scene that sprang up when CLIP was released. OpenAI still hasn’t released the DALL-E model that translates CLIP’s embeddings of text into images.[11] But that didn’t stop graduate students and interested amateurs from duct-taping CLIP to various generative image models and using the contraption to explore visual culture in dizzying ways. 

“The angel of air. Unreal Engine,” VQGAN + CLIP, Aran Komatsukaki, May 31, 2021.

Will the emergence of this subculture make any sense if we assume that CLIP is just a failed attempt to reproduce individual language use? In practice, the people tinkering with CLIP don’t expect it to respond like a human reader. More to the point, they don’t want it to. They’re fascinated because CLIP uses language differently than a human individual would—mashing together the senses and overtones of words and refracting them into the potential space of internet images like a new kind of synesthesia.[12] The pictures produced are fascinating, but (at least for now) too glitchy to impress most people as art. They’re better understood as postcards from an unmapped latent space.[13] The point of a postcard, after all, is not to be itself impressive, but to evoke features of a larger region that looks fun to explore. Here the “region” is a particular visual culture; artists use CLIP to find combinations of themes and styles that could have occurred within it (although they never quite did).

“The clockwork angel of air, trending on ArtStation,” Diffusion + CLIP, @rivershavewings (Katherine Crowson), September 14, 2021.

Will models of this kind also have negative effects? Absolutely. The common observation that “they could reinforce existing biases” is the mildest possible example. If we approach neural models as machines for mapping and rewiring collective behavior, we will quickly see that they could do much worse: for instance, deepfakes could create new hermetically sealed subcultures and beliefs that are impossible to contest. 

I’m not trying to decide whether neural language models are good or bad in this essay—just trying to clarify what’s being modeled, why people care, and what kinds of (good or bad) effects we might expect. Reaching a comprehensive judgment is likely to take decades. After all, models are easy to distribute. So this was never a problem, like gene splicing, that could stay bottled up as an ethical dilemma for one profession that controlled the tools. Neural models more closely resemble movable type: they will change the way culture is transmitted in many social contexts. Since the consequences of movable type included centuries of religious war in Europe, the analogy is not meant to reassure. I just mean that questions on this scale don’t get resolved quickly or by experts. We are headed for a broadly political debate about antitrust, renewable energy, and the shape of human culture itself—a debate where everyone will have some claim to expertise.[14]

Let me end, however, on a positive note. I have suggested that approaching neural models as models of culture rather than intelligence or individual language use gives us even more reason to worry. But it also gives us more reason to hope. It is not entirely clear what we plan to gain by modeling intelligence, since we already have more than seven billion intelligences on the planet. By contrast, it’s easy to see how exploring spaces of possibility implied by the human past could support a more reflective and more adventurous approach to our future. I can imagine a world where generative models of culture are used grotesquely or locked down as IP for Netflix. But I can also imagine a world where fan communities use them to remix plot tropes and gender norms, making “mass culture” a more self-conscious, various, and participatory phenomenon than the twentieth century usually allowed it to become. 

I don’t know which of those worlds we will build. But either way, I suspect we will need to reframe our conversation about artificial intelligence as a conversation about models of culture and the latent spaces they imply. Philosophers and science fiction writers may enjoy debating whether software can have mental attributes like intention. But that old argument does little to illuminate the social questions new technologies are really raising. Neural language models are dangerous and fascinating because they can illuminate and transform shared patterns of behavior—in other words, aspects of culture. When the problem is redescribed this way, the concerns about equity foregrounded by “Stochastic Parrots” still matter deeply. But the imagined contrast between mimicry and meaning in the article’s title no longer connects with any satirical target. Culture clearly has meaning. But I’m not sure that anyone cares whether a culture has autonomous intent, or whether it is merely parroting human action.


[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, “Attention is All You Need,” 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, 2017. https://arxiv.org/abs/1706.03762 
[2] Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Margaret Mitchell, “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” FAccT '21: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, March 2021, 610–623. https://doi.org/10.1145/3442188.3445922
[3] Rishi Bommasani et al., “On the Opportunities and Risks of Foundation Models,” CoRR Aug 2021, https://arxiv.org/abs/2108.07258, 3-4.
[4] “[I]t is language which speaks, not the author.” Roland Barthes, “The Death of the Author,” Image / Music / Text, trans. Stephen Heath (New York: Hill and Wang, 1977), 143. 
[5] To this list one might also add the material and social aspects of book production. In commenting on “Stochastic Parrots,” Katherine Bode notes that book history prefers to paint a picture where “meaning is dispersed across…human and non-human agents.” Katherine Bode, qtd. in Lauren M. E. Goodlad, “Data-centrism and its Discontents,” Critical AI, Oct 15, 2021, https://criticalai.org/2021/10/14/blog-recap-stochastic-parrots-ethics-of-data-curation/
[6] Sandeep Soni, Lauren F. Klein, and Jacob Eisenstein, “Abolitionist Networks: Modeling Language Change in Nineteenth-Century Activist Newspapers,” Journal of Cultural Analytics, January 18, 2021, https://culturalanalytics.org/article/18841-abolitionist-networks-modeling-language-change-in-nineteenth-century-activist-newspapers. Ted Underwood, “Machine Learning and Human Perspective,” PMLA 135.1 (Jan 2020): 92-109, http://hdl.handle.net/2142/109140.
[7] I’m writing about “neural language models” rather than “large” ones because I don’t assume that ever-increasing size is a definitional feature of this technology. Strategies to improve efficiency are discussed in Bommasani et al., 97-100.
[8] Bhuwan Dhingra, Jeremy R. Cole, Julian Martin Eisenschlos, Daniel Gillick, Jacob Eisenstein and William W. Cohen, “Time-Aware Language Models as Temporal Knowledge Bases,” CoRR 2021, https://arxiv.org/abs/2106.15110.
[9] See for instance, Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman, “GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow,” March 2021, https://doi.org/10.5281/zenodo.5297715.
<a href="#_ftnref9"[10] White House Briefing Room, “The White House Announces the National Artificial Intelligence Research Resource Task Force,” June 10, 2021, https://www.whitehouse.gov/ostp/news-updates/2021/06/10/the-biden-administration-launches-the-national-artificial-intelligence-research-resource-task-force/
[11] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, Ilya Sutskever, “Zero-Shot Text-to-Image Generation,” February 2021, https://arxiv.org/abs/2102.12092.
[12] One good history of this scene is titled “Alien Dreams”—a title that concisely indicates how little interest artists have in using CLIP to reproduce human behavior. Charlie Snell, “Alien Dreams: An Emerging Art Scene,” June 30, 2021, https://ml.berkeley.edu/blog/posts/clip-art/.
[13] For a skeptical history of this spatial metaphor, see Nick Seaver, “Everything Lies in a Space: Cultural Data and Spatial Reality,” Journal of the Royal Anthropological Institute 27 (2021). https://doi.org/10.1111/1467-9655.13479. We also skeptically probe the limits of spatial metaphors for culture (but end up confirming their value) in Ted Underwood and Richard Jean So, “Can We Map Culture?” Journal of Cultural Analytics, June 17, 2021, https://doi.org/10.22148/001c.24911.
[14] I haven’t said much about the energy cost of training models. For one thing, I’m not fully informed about contemporary efforts to keep that cost low. More importantly, I think the cause of carbon reduction is actively harmed by pitting different end users against each other. If we weigh the “carbon footprint” of your research agenda against my conference travel, the winner will almost certainly be British Petroleum. Renewable energy is a wiser thing to argue about if carbon reduction is actually our goal. Mark Kaufman, “The carbon footprint sham: a ‘successful, deceptive’ PR campaign,” Mashable, July 9, 2021, https://mashable.com/feature/carbon-footprint-pr-campaign-sham.

Categories
machine learning transformer models

Science fiction hasn’t prepared us to imagine machine learning.

Science fiction did a great job preparing us for submarines and rockets. But it seems to be struggling lately. We don’t know what to hope for, what to fear, or what genre we’re even in.

Space opera? Seems unlikely. And now that we’ve made it to 2021, the threat of zombie apocalypse is receding a bit. So it’s probably some kind of cyberpunk. But there are many kinds of cyberpunk. Should we get ready to fight AI or to rescue replicants from a sinister corporation? It hasn’t been obvious. I’m writing this, however, because recent twists in the plot seem to clear up certain mysteries, and I think it’s now possible to guess which subgenre the 2020s are steering toward.

Clearly some plot twist involving machine learning is underway. It’s been hard to keep up with new developments: from BERT (2018) to GPT-3 (2020)—which can turn a prompt into an imaginary news story—to, most recently, CLIP and DALL-E (2021), which can translate verbal descriptions into images.

Output from DALL-E. If you prefer, you can have a baby daikon radish in a tutu walking a dog.

I have limited access to DALL-E, and can’t test it in any detail. But if we trust the images released by Open AI, the model is good at fusing and extrapolating abstractions: it not only knows what it means for a lemur to hold an umbrella, but can produce a surprisingly plausible “photo of a television from the 1910s.” All of this is impressive for a research direction that isn’t much more than four years old.

The prompt here is “a photo of a television from the …<fill in the decade>”

On the other hand, some AI researchers don’t believe these models are taking the field in the direction it was supposed to go. Gary Marcus and Ernest Davies, for instance, doubt that GPT-3 is “an important step toward artificial general intelligence—the kind that would … reason broadly in a manner similar to humans … [GPT-3] learns correlations between words, and nothing more.”

People who want to contest that claim can certainly find evidence on the other side of the question. I’m not interested in pursuing the argument here. I just want to know why recent advances in deep learning give me a shivery sense that I’ve crossed over into an unfamiliar genre. So let’s approach the question from the other side: what if these models are significant because they don’t reason “in a manner similar to humans”?

It is true, after all, that models like DALL-E and GPT-3 are only learning (complex, general) patterns of association between symbols. When GPT-3 generates a sentence, it is not expressing an intention or an opinion—just making an inference about the probability of one sentence in a vast “latent space” of possible sentences implied by its training data.

When I say “a vast latent space,” I mean really vast. This space includes, for instance, the thoughts Jerome K. Jerome might have expressed about Twitter if he had lived in our century.

Mario Klingemann gets GPT-3 to extrapolate from a title and a byline.

But a latent space, however vast, is still quite different from goal-driven problem solving. In a sense the chimpanzee below is doing something more like human reasoning than a language model can.

Primates, understandably, envision models of the world as things individuals create in order to reach bananas. (Ultimately from Wolfgang Köhler, The Mentality of Apes, 1925.)

Like us, the chimpanzee has desires and goals, and can make plans to achieve them. A language model does none of that by itself—which is probably why language models are impressive at the paragraph scale but tend to wander if you let them run for pages.

So where does that leave us? We could shrug off the buzz about deep learning, say “it’s not even as smart as a chimpanzee yet,” and relax because we’re presumably still living in a realist novel.

And yes, to be sure, deep learning is in its infancy and will be improved by modeling larger-scale patterns. On the other hand, it would be foolish to ignore early clues about what it’s good for. There is something bizarrely parochial about a view of mental life that makes predicting a nineteenth-century writer’s thoughts about Twitter less interesting than stacking boxes to reach bananas. Perhaps it’s a mistake to assume that advances in machine learning are only interesting when they resemble our own (supposedly “general”) intelligence. What if intelligence itself is overrated?

The collective symbolic system we call “culture,” for instance, coordinates human endeavors without being itself intelligent. What if models of the world (including models of language and culture) are important in their own right—and needn’t be understood as attempts to reproduce the problem-solving behavior of individual primates? After all, people are already very good at having desires and making plans. We don’t especially need a system that will do those things for us. But we’re not great at imagining the latent space of (say) all protein structures that can be created by folding amino acids. We could use a collaborator there.

Storytelling seems to be another place where human beings sense a vast space of latent possibility, and tend to welcome collaborators with maps. Look at what’s happening to interactive fiction on sites like AI Dungeon. Tens of thousands of users are already making up stories interactively with GPT-3. There’s a subreddit devoted to the phenomenon. Competitors are starting to enter the field. One startup, Hidden Door, is trying to use machine learning to create a safe social storytelling space for children. For a summary of what collaborative play can build, we could do worse than their motto: “Worlds with Friends.”

It’s not hard to see how the “social play” model proposed by Hidden Door could eventually support the form of storytelling that grown-ups call fan fiction. Characters or settings developed by one author might be borrowed by others. Add something like DALL-E, and writers could produce illustrations for their story in a variety of styles—from Arthur Rackham to graphic novel.

Will a language model ever be as good as a human author? Can it ever be genuinely original? I don’t know, and I suspect those are the wrong questions. Storytelling has never been a solitary activity undertaken by geniuses who invent everything from scratch. From its origin in folk tales, fiction has been a game that works by rearranging familiar moves, and riffing on established expectations. Machine learning is only going to make the process more interactive, by increasing the number of people (and other agents) involved in creating and exploring fictional worlds. The point will not be to replace human authors, but to make the universe of stories bigger and more interconnected.

Storytelling and protein folding are two early examples of domains where models will matter not because they’re “intelligent,” but because they allow us—their creators—to collaboratively explore a latent space of possibility. But I will be surprised if these are the only two places where that pattern emerges. Music and art, and other kinds of science, are probably open to the same kind of exploration.

This collaborative future could be weirder than either science fiction or journalism have taught us to expect. News stories about ML invariably invite readers to imagine autonomous agents analogous to robots: either helpful servants or inscrutable antagonists like the Terminator and HAL. Boring paternal condescension or boring dread are the only reactions that seem possible within this script.

We need to be considering a wider range of emotions. Maybe a few decades from now, autonomous AI will be a reality and we’ll have to worry whether it’s servile or inscrutable. Maybe? But that’s not the genre we’re in at the moment. Machine learning is already transforming our world, but the things that should excite and terrify us about the next decade are not even loosely analogous to robots. We should be thinking instead about J. L. Borges’ Library of Babel—a vast labyrinth containing an infinite number of books no eye has ever read. There are whole alternate worlds on those shelves, but the Library is not a robot, an alien, or a god. It is just an extrapolation of human culture.

Eric Desmazieres, “The Library of Babel.”

Machine learning is going to be, let’s say, a thread leading us through this Library—or perhaps a door that can take us to any bookshelf we imagine. So if the 2020s are a subgenre of SF, I would personally predict a mashup of cyberpunk and portal fantasy. With sinister corporations, of course. But also more wardrobes, hidden doors, encylopedias of Tlön, etc., than we’ve been led to expect in futuristic fiction.

I’m not saying this will be a good thing! Human culture itself is not always a good thing, and extrapolating it can take you places you don’t want to go. For instance, movements like QAnon make clear that human beings are only too eager to invent parallel worlds. Armored with endlessly creative deepfakes, those worlds might become almost impenetrable. So we’re probably right to fear the next decade. But let’s point our fears in a useful direction, because we have more interesting things to worry about than a servant who refuses to “open the pod bay doors.” We are about to be in a Borges story, or maybe, optimistically, the sort of portal fantasy where heroines create doors with a piece of chalk and a few well-chosen words. I have no idea how our version of that story ends, but I would put a lot of money on “not boring.”

Categories
machine learning problems of scale

A more intimate scale of distant reading.

How big, exactly, does a collection of literary texts have to be before it makes sense to say we’re doing “distant reading”?

It’s a question people often ask, and a question that distant readers often wriggle out of answering, for good reason. The answer is not determined by the technical limits of any algorithm. It depends, rather, on the size of the blind spots in our knowledge of the literary past — and it’s part of the definition of a blind spot that we don’t already know how big it is. How far do you have to back up before you start seeing patterns that were invisible at your ordinary scale of reading? That’s how big your collection needs to be.

But from watching trends over the last couple of years, I am beginning to get the sense that the threshold for distant reading is turning out to be a bit lower than many people are currently assuming (and lower than I assumed myself in the past). To cut to the chase: it’s probably dozens or scores of books, rather than thousands.

I think there are several reasons why we all got a different impression. One is that Franco Moretti originally advertised distant reading as a continuation of 1990s canon-expansion: the whole point, presumably, was to get beyond the canon and recover a vast “slaughterhouse of literature.” That’s still one part of the project — and it leads to a lot of debate about the difficulty of recovering that slaughterhouse. But sixteen years later, it is becoming clear that new methods also allow us to do a whole lot of things that weren’t envisioned in Moretti’s original manifesto. Even if we restricted our collections to explicitly canonical works, we would still be able to tease out trends that are too long, or family resemblances that are too loose, to be described well in existing histories.

The size of the collection required depends on the question you’re posing. Unsupervised algorithms, like those used for topic modeling, are easy to package as tools: just pour in the books, and out come some topics. But since they’re not designed to answer specific questions, these approaches tend to be most useful for exploratory problems, at large scales of inquiry. (One recent project by Emily Barry, for instance, uses 22,000 Supreme Court cases.)

By contrast, a lot of recent work in distant reading has used supervised models to zero in on narrowly specified historical questions about genre or form. This approach can tell you things you didn’t already know at a smaller scale of inquiry. In “Literary Pattern Recognition,” Hoyt Long and Richard So start by gathering 400 poems in the haiku tradition. In a recent essay on genre I talk about several hundred works of detective fiction, but also ten hardboiled detective novels, and seven Newgate novels.

 

Figure5Generational
Predictive accuracy for several genres of roughly generational size, plotted relative to a curve that indicates accuracy for a random sample of detective fiction drawn from the whole period 1829-1989. The shaded ribbon covers 90% of models for a given number of examples.

Admittedly, seven is on the low side. I wouldn’t put a lot of faith in any individual dot above. But I do think we can learn something by looking at five subgenres that each contain 7-21 volumes. (In the graph above we learn, for instance, that focused “generational” genres aren’t lexically more coherent than a sample drawn from the whole 160 years of detective fiction — because the longer tradition is remarkably coherent, and pretty easy to recognize, even when you downsample it to ten or twenty volumes.)

I’d like to pitch this reduction of scale as encouraging news. Grad students and assistant professors don’t have to build million-volume collections before they can start exploring new methods. And literary scholars can practice distant reading without feeling they need to buy into any cyclopean ethic of “big data.” (I’m not sure that ethic exists, except as a vaguely-sketched straw man. But if it did exist, you wouldn’t need to buy into it.)

Computational methods themselves won’t even be necessary for all of this work. For some questions, standard social-scientific content analysis (aka reading texts and characterizing them according to an agreed-upon scheme) is a better way to proceed. In fact, if you look back at “The Slaughterhouse of Literature,” that’s what Moretti did with “about twenty” detective stories (212). Shawna Ross recently did something similar, looking at the representation of women’s scholarship at MLA#16 by reading and characterizing 792 tweets.

Humanists still have a lot to learn about social-scientific methods, as Tanya Clement has recently pointed out. (Inter-rater reliability, anyone?) And I think content analysis will run into some limits as we stretch the timelines of our studies: as you try to cover centuries of social change, it gets hard to frame a predefined coding scheme that’s appropriate for everything on the timeline. Computational models have some advantages at that scale, because they can be relatively flexible. Plus, we actually do want to reach beyond the canon.

But my point is simply that “distant reading” doesn’t prescribe a single scale of analysis. There’s a smooth ramp that leads from describing seven books, to characterizing a score or so (still by hand, but in a more systematic way), to statistical reflection on the uncertainty and variation in your evidence, to text mining and computational modeling (which might cover seven books or seven hundred). Proceed only as far as you find useful for a given question.

Categories
disciplinary history machine learning methodology

Emerging conversations between literary history and sociology.

As Jim English remarked in 2010, literary scholars have tended to use sociology “for its conclusions rather than its methods.” We might borrow a term like “habitus” from Bourdieu, but we weren’t interested in borrowing correspondence analysis. If we wanted to talk about methodology with social scientists at all, we were more likely to go to the linguists. (A connection to linguistics in fact almost defined “humanities computing.”)

But a different conversation seems to have emerged recently. A special issue of Poetics on topic models in 2013 was one early sign of methodological conversation between sociology and literary study. This year, Ben Merriman’s sociological review of books by Moretti and Jockers was followed by comments from Andrew Goldstone and Tressie McMillan Cottom, and then by a special issue of Cultural Sociology and by Goldstone’s response to Gisèle Sapiro. Most recently a special issue of Big Data and Society (table of contents), organized by sociologists, included several articles on literary history and/or literary theory.

What’s going on here?

Conveniently, several articles in Big Data and Society are trying to explain the reasons for growing methodological overlap between these disciplines. I think it’s interesting that the sociologists and literary scholars involved are telling largely the same story (though viewing it, perhaps, from opposite sides of a mirror).

First, the perspective of social scientists. In “Toward a computational hermeneutics,” John W. Mohr, Robin Wagner-Pacifici, and Ronald L. Breiger (who collectively edited this special issue of BDS) suggest that computational methods are facilitating a convergence between the social-scientific tradition of “content analysis” and kinds of close reading that have typically been more central to the humanities.

Close reading? Well, yes, relative to what was previously possible at scale. Content analysis was originally restricted to predefined keywords and phrases that captured the “manifest meaning of a textual corpus” (2). Other kinds of meaning, implicit in “complexities of phrasing” or “rhetorical forms,” had to be discarded to make text usable as data. But according to the authors, computational approaches to text analysis “give us the ability to instead consider a textual corpus in its full hermeneutic complexity,” going beyond the level of interpretation Kenneth Burke called “semantic” to one he considered “poetic” (3-4). This may be interpretation on a larger scale than literary scholars are accustomed to, but from the social-scientific side of the border, it looks like a move in our direction.

JariSchroderus, "Through the Looking Glass," 2006, CC BY-NC-ND 2.0.
Jari Schroderus, “Through the Looking Glass,” 2006, CC BY-NC-ND 2.0.
The essay I contributed to BDS tells a mirror image of this story. I think twentieth-century literary scholars were largely right to ignore quantitative methods. The problems that interested us weren’t easy to represent, for exactly the reason Mohr, Wagner-Pacifici, and Breiger note: the latent complexities of a text had to be discarded in order to treat it as structured data.

But that’s changing. We can pour loosely structured qualitative data into statistical models these days, and that advance basically blurs the boundary we have taken for granted between the quantitative social sciences and humanities. We can create statistical models now where loosely structured texts sit on one side of an equals sign, and evidence about social identity, prestige, or power sits on the other side.

For me, the point of that sort of model is to get beyond one of the frustrating limitations of “humanities computing,” which was that it tended to stall out at the level of linguistic detail. Before we could pose questions about literary form or social conflict, we believed we had to first agree on a stopword list, and a set of features, and a coding scheme, and … in short, if social questions can only be addressed after you solve all the linguistic ones, you never get to any social questions.

But (as I explain at more length in the essay) new approaches to statistical modeling are less finicky about linguistic detail than they used to be. Instead of fretting endlessly about feature selection and xml tags, we can move on to the social questions we want to pose — questions about literary prestige, or genre, or class, or race, or gender. Text can become to some extent a space where we trace social boundaries and study the relations between them.

In short, the long-standing (and still valuable) connection between digital literary scholarship and linguistics can finally be complemented by equally strong connections to other social sciences. I think those connections are going to have fruitful implications, beginning to become visible in this issue of Big Data and Society, and (just over the horizon) in work in progress sponsored by groups like NovelTM and the Chicago Text Lab.

A final question raised by this interdisciplinary conversation involves the notion of big data foregrounded in the journal title. For social scientists, “big data” has a fairly clear meaning — which has less to do with scale, really, than with new ways of gathering data without surveys. But of course surveys were never central to literary study, and it may be no accident that few of the literary scholars involved in this issue of BDS are stressing the bigness of big data. We’ve got terabytes of literature in digital libraries, and we’re using them. But we’re not necessarily making a fuss about “bigness” as such.

Rachel Buurma’s essay on topic-modeling Trollope’s Barsetshire novels explicitly makes a case for the value of topic-modeling at an intermediate scale — while, by the way, arguing persuasively that a topic model is best understood as an “uncanny, shifting, temporary index,” or “counter-factual map” (4). In my essay I discuss a collection of 720 books. That may sound biggish relative to what literary scholars ordinarily do, but it’s explicitly a sample rather than an attempt at coverage, and I argue against calling it big data.

There are a bunch of reasons for that. I’ve argued in the past that the term doesn’t have a clear meaning for humanists. But my stronger objection is that it distracts readers from more interesting things. It allows us to imagine that recent changes are just being driven by faster computers or bigger disks — and obscures underlying philosophical developments that would fascinate humanists if we knew about them.

I believe the advances that matter for humanists have depended less on sheer scale than on new ideas about what it means to model evidence (i.e., learn from it, generalize from it). Machine learning honestly is founded on a theory of learning, and it’s kind of tragic that humanists are understanding something that interesting as a purely technical phenomenon called “big data.” I’m not going to try to explain statistical theories of learning in a short blog post, but in my essay I do at least gesture at a classic discussion by Leo Breiman. Some of my observations overlap with an essay in this same issue of BDS by Paul DiMaggio, who is likewise interested in the epistemological premises involved in machine learning.

Categories
18c 19c 20c collection-building discovery strategies machine learning

How to find English-language fiction, poetry, and drama in HathiTrust.

Although methods of analysis are more fun to discuss, the most challenging part of distant reading may still be locating the texts in the first place [1].

In principle, millions of books are available in digital libraries. But literary historians need collections organized by genre, and locating the fiction or poetry in a digital library is not as simple as it sounds. Older books don’t necessarily have genre information attached. (In HathiTrust, less than 40% of English-language fiction published before 1923 is tagged “fiction” in the appropriate MARC control field.)

Volume-level information wouldn’t be enough to guide machine reading in any case, because genres are mixed up inside volumes. For instance Hoyt Long, Richard So, and I recently published an article in Slate arguing (among other things) that references to specific amounts of money become steadily more common in fiction from 1825 to 1950.

Frequency of reference to "specific amounts" of money in 7,700 English-language works of fiction. Graphics from Wickham, ggplot2 [2].
Frequency of reference to “specific amounts” of money in 7,700 English-language works of fiction. Graphics here and throughout from Wickham, ggplot2 [2].

But Google’s “English Fiction” collection tells a very different story. The frequencies of many symbols that appear in prices (dollar signs, sixpence) skyrocket in the late nineteenth century, and then drop back by the early twentieth.

Frequencies of "$" and "6d" in Google's "English Fiction" collection, 1800-1950.
Frequencies of “$” and “6d” in Google’s “English Fiction” collection, 1800-1950.

On the other hand, several other words or symbols that tend to appear in advertisements for books follow a suspiciously similar trajectory.

Frequencies of "$", "8vo" (octavo) and "cloth" in Google's "English Fiction" collection, 1800-1950.
Frequencies of “$”, “8vo” (octavo) and “cloth” in Google’s “English Fiction” collection, 1800-1950.

What we see in Google’s “Fiction” collection is something that happens in volumes of fiction, but not exactly in the genre of fiction — the rise and fall of publishers’ catalogs in the backs of books [3]. Individually, these two- or three-page lists of titles for sale may not look like significant noise, but because they often mention prices, and are distributed unevenly across the timeline, they add up to a significant potential pitfall for anyone interested in the role of money in fiction.

I don’t say this to criticize the team behind the Ngram Viewer. Genre wasn’t central to their goals; they provided a rough “fiction” collection merely as a cherry on top of a massively successful public-humanities project. My point is just that genres fail to line up with volume boundaries in ways that can really matter for the questions scholars want to pose. (In fact, fiction may be the genre that comes closest to lining up with volume boundaries: drama and poetry often appear mixed in The Collected Poems and Plays of So-and-So, With a Prose Life of the Author.)

You can solve this problem by selecting works manually, or by borrowing proprietary collections from a vendor. Those are both good, practical solutions, especially up to (say) 1900. But because they rely on received bibliographies, they may not entirely fulfill the promises we’ve been making about dredging the depths of “the great unread,” boldly going where no one has gone before, etc [4]. Over the past two years, with support from the ACLS and NEH, I’ve been trying to develop another alternative — a way of starting with a whole library, and dividing it by genre at the page level, using machine learning.

In researching the Slate article, we relied on that automatic mapping of genre to select pages of fiction from HathiTrust. It helped us avoid conflating advertisements with fiction, and I hope other scholars will also find that it reduces the labor involved in creating large, genre-specific collections. The point of this blog post is to announce the release of a first version of the map we used (covering 854,476 English-language books in HathiTrust 1700-1922).

The whole dataset is available on Figshare, where it has a DOI and is citable as a publication. An interim report is also available; it addresses theoretical questions about genre, as well as questions about methods and data format. And the code we used for the project is available on Github.

For in-depth answers to questions, please consult the interim project report. It’s 47 pages long; it actually explains the project; this blog post doesn’t. But here are a few quick FAQs just so you can decide whether to read further.

“What categories did you try to separate?”

We identify pages as paratext (front matter, back matter, ads), prose nonfiction, poetry (narrative and lyric are grouped together), drama (including verse drama), or prose fiction. The report discusses the rationale for these choices, but other choices would be possible.

“How accurate is this map?”

Since genres are social institutions, questions about accuracy are relative to human dissensus. Our pairs of human readers agreed about the five categories just mentioned for 94.5% of the pages they tagged [5]. Relying on two-out-of-three voting (among other things), we boiled those varying opinions down to a human consensus, and our model agreed with the consensus 93.6% of the time. So this map is nearly as accurate as we might expect crowdsourcing to be. But it covers 276 million pages. For full details, see the confusion matrices in the report. Also, note that we provide ways of adjusting the tradeoff between recall and precision to fit a researcher’s top priority — which could be catching everything that might belong in a genre, or filtering out everything that doesn’t belong. We provide filtered collections of drama, fiction, and poetry for scholars who want to work with datasets that are 97-98% precise.

“You just wrote a blog post admitting that even simple generic boundaries like fiction/nonfiction are blurry and contested. So how can we pretend to stabilize a single map of genre?”

The short answer: we can’t. I don’t expect the genre predictions in this dataset to be more than one resource among many. We’ve also designed this dataset to have a certain amount of flexibility. There are confidence metrics associated with each volume, and users can define their collection of, say, poetry more broadly or narrowly by adjusting the confidence thresholds for inclusion. So even this dataset is not really a single map.

“What about divisions below the page level?”

With the exception of divisions between running headers and body text, we don’t address them. There are certainly a wide range of divisions below the page level that can matter, but we didn’t feel there was much to be gained by trying to solve all those problems at the same time as page-level mapping. In many cases, divisions below the page level are logically a subsequent step.

“How would I actually use this map to find stuff?”

There are three different ways — see “How to use this data?” in the interim report. If you’re working with HathiTrust Research Center, you could use this data to define a workset in their portal. Alternatively, if your research question can be answered with word frequencies, you could download public page-level features from HTRC and align them with our genre predictions on your own machine to produce a dataset of word counts from “only pages that have a 97% probability of being prose fiction,” or what have you. (HTRC hasn’t released feature counts for all the volumes we mapped yet, but they’re about to.) You can also align our predictions directly with HathiTrust zip files, if you have those. The pagealigner module in the utilities subfolder of our Github repo is intended as a handy shortcut for people who use Python; it will work both with HT zip files and HTRC feature files, aligning them with our genre predictions and returning a list of pages zipped with genre codes.

Is this sort of collection really what I need for my project?

Maybe not. There are a lot of books in HathiTrust. But as I admitted in my last post, a medium-sized collection based on bibliographies may be a better starting point for most scholars. Library-based collections include things like reprints, works in translation, juvenile fiction, and so on, that could be viewed as giving a fuller picture of literary culture … or could be viewed as messy complicating factors. I don’t mean to advocate for a library-based approach; I’m just trying to expand the range of alternatives we have available.

“What if I want to find fiction in French books between 1900 and 1970?”

Although we’ve made our code available as a resource, we definitely don’t want to represent it as a “tool” that could simply be pointed at other collections to do the same kind of genre mapping. Much of the work involved in this process is domain-specific (for instance, you have to develop page-level training data in a particular language and period). So this is better characterized as a method than a tool, and the report is probably more important than the repo. I plan to continue expanding the English-language map into the twentieth century (algorithmic mapping of genre may in fact be especially necessary for distant reading behind the veil of copyright). But I don’t personally have plans to expand this map to other languages; I hope someone else will take up that task.

As a reward for reading this far, here’s a visualization of the relative sizes of genres across time, represented as a percentage of pages in the English-language portion of HathiTrust.

The relative sizes of different genres, represented as a percentage of pages in the English-language portion of HathiTrust. 854,476 volumes are covered. Nonfiction, front matter, and back matter aren't represented here. Results have been smoothed with a five-year moving average.
The relative sizes of different genres, represented as a percentage of pages in the English-language portion of HathiTrust. 854,476 volumes are covered. Nonfiction, front matter, and back matter aren’t represented here. Results have been smoothed with a five-year moving average. Click through to enlarge.

The image is discussed at more length in the interim progress report.

Acknowledgments

The blog post above often slips awkwardly into first-person plural, because I’m describing a project that involved a lot of people. Parts of the code involved were written by Michael L. Black and Boris Capitanu. The code also draws on machine learning libraries in Weka and Scikit-Learn [6, 7]. Shawn Ballard organized the process of gathering training data, assisted by Jonathan Cheng, Nicole Moore, Clara Mount, and Lea Potter. The project also depended on collaboration and conversation with a wide range of people at HathiTrust Digital Library, HathiTrust Research Center, and the University of Illinois Library, including but not limited to Loretta Auvil, Timothy Cole, Stephen Downie, Colleen Fallaw, Harriett Green, Myung-Ja Han, Jacob Jett, and Jeremy York. Jana Diesner and David Bamman offered useful advice about machine learning. Essential material support was provided by a Digital Humanities Start-Up Grant from the National Endowment for the Humanities and a Digital Innovation Fellowship from the American Council of Learned Societies. None of these people or agencies should be held responsible for mistakes.

References

[1] Perhaps it goes without saying, since the phrase has now lost its quotation marks, but “distant reading” is Franco Moretti, “Conjectures on World Literature,” New Left Review 1 (2000).

[2] Hadley Wickham, ggplot2: Elegant Graphics for Data Analysis. http: //had.co.nz/ggplot2/book. Springer New York, 2009.

[3] Having mapped advertisements in volumes of fiction, I’m pretty certain that they’re responsible for the spike in dollar signs in Google’s “English Fiction” collection. The collection I mapped overlaps heavily with Google Books, and the number of pages of ads in fiction volumes tracks very closely with the frequency of dollars signs, “8vo,” and so on.

Percentage of pages in mostly-fiction volumes that are ads. Based on a filtered collection of 102,349 mostly-fiction volumes selected from a larger group of 854,476 volumes 1700-1922.
Percentage of pages in mostly-fiction volumes that are ads. Based on a filtered collection of 102,349 mostly-fiction volumes selected from a larger group of 854,476 volumes 1700-1922. Five-year moving average.

[4] “The great unread” comes from Margaret Cohen, The Sentimental Education of the Novel (Princeton NJ: Princeton University Press, 1999), 23.

[5] See the interim report (subsection, “Evaluating Confusion Matrices”) for a fuller description; it gets complicated, because we actually assessed accuracy in terms of the number of words misclassified, although the classification was taking place at a page level.

[6] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.

[7] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. The WEKA data mining software: An update. SIGKDD Explorations, 11(1), 2009.

Categories
disciplinary history interpretive theory machine learning methodology

Measurement and modeling.

If the Internet is good for anything, it’s good for speeding up the Ent-like conversation between articles, to make that rumble more perceptible by human ears. I thought I might help the process along by summarizing the Stanford Literary Lab’s latest pamphlet — a single-authored piece by Franco Moretti, “‘Operationalizing’: or the function of measurement in modern literary theory.”

One of the many strengths of Moretti’s writing is a willingness to dramatize his own learning process. This pamphlet situates itself as a twist in the ongoing evolution of “computational criticism,” a turn from literary history to literary theory.

Measurement as a challenge to literary theory, one could say, echoing a famous essay by Hans Robert Jauss. This is not what I expected from the encounter of computation and criticism; I assumed, like so many others, that the new approach would change the history, rather than the theory of literature ….

Measurement challenges literary theory because it asks us to “operationalize” existing critical concepts — to say, for instance, exactly how we know that one character occupies more “space” in a work than another. Are we talking simply about the number of words they speak? or perhaps about their degree of interaction with other characters?

Moretti uses Alex Woloch’s concept of “character-space” as a specific example of what it means to operationalize a concept, but he’s more interested in exploring the broader epistemological question of what we gain by operationalizing things. When literary scholars discuss quantification, we often tacitly assume that measurement itself is on trial. We ask ourselves whether measurement is an adequate proxy for our existing critical concepts. Can mere numbers capture the ineffable nuances we assume they possess? Here, Moretti flips that assumption and suggests that measurement may have something to teach us about our concepts — as we’re forced to make them concrete, we may discover that we understood them imperfectly. At the end of the article, he suggests for instance (after begging divine forgiveness) that Hegel may have been wrong about “tragic collision.”

I think Moretti is frankly right about the broad question this pamphlet opens. If we engage quantitative methods seriously, they’re not going to remain confined to empirical observations about the history of predefined critical concepts. Quantification is going to push back against the concepts themselves, and spill over into theoretical debate. I warned y’all back in August that literary theory was “about to get interesting again,” and this is very much what I had in mind.

At this point in a scholarly review, the standard procedure is to point out that a work nevertheless possesses “oversights.” (Insight, meet blindness!) But I don’t think Moretti is actually blind to any of the reflections I add below. We have differences of rhetorical emphasis, which is not the same thing.

For instance, Moretti does acknowledge that trying to operationalize concepts could cause them to dissolve in our hands, if they’re revealed as unstable or badly framed (see his response to Bridgman on pp. 9-10). But he chooses to focus on a case where this doesn’t happen. Hegel’s concept of “tragic collision” holds together, on his account; we just learn something new about it.

In most of the quantitative projects I’m pursuing, this has not been my experience. For instance, in developing statistical models of genre, the first thing I learned was that critics use the word genre to cover a range of different kinds of categories, with different degrees of coherence and historical volatility. Instead of coming up with a single way to operationalize genre, I’m going to end up producing several different mapping strategies that address patterns on different scales.

Something similar might be true even about a concept like “character.” In Vladimir Propp’s Morphology of the Folktale, for instance, characters are reduced to plot functions. Characters don’t have to be people or have agency: when the hero plucks a magic apple from a tree, the tree itself occupies the role of “donor.” On Propp’s account, it would be meaningless to represent a tale like “Le Petit Chaperon Rouge” as a social network. Our desire to imagine narrative as a network of interactions between imagined “people” (wolf ⇌ grandmother) presupposes a separation between nodes and edges that makes no sense for Propp. But this doesn’t necessarily mean that Moretti is wrong to represent Hamlet as a social network: Hamlet is not Red Riding Hood, and tragic drama arguably envisions character in a different way. In short, one of the things we might learn by operationalizing the term “character” is that the term has genuinely different meanings in different genres, obscured for us by the mere continuity of a verbal sign. [I should probably be citing Tzvetan Todorov here, The Poetics of Prose, chapter 5.]

Illustration from "Learning Latent Personas of Film Characters," Bamman et. al.
Illustration from “Learning Latent Personas of Film Characters,” Bamman et. al.
Another place where I’d mark a difference of emphasis from Moretti involves the tension, named in my title, between “measurement” and “modeling.” Moretti acknowledges that there are people (like Graham Sack) who assume that character-space can’t be measured directly, and therefore look for “proxy variables.” But concepts that can’t be directly measured raise a set of issues that are quite a bit more challenging than the concept of a “proxy” might imply. Sack is actually trying to build models that postulate relations between measurements. Digital humanists are probably most familiar with modeling in the guise of topic modeling, a way of mapping discourse by postulating latent variables called “topics” that can’t be directly observed. But modeling is a flexible heuristic that could be used in a lot of different ways.

The illustration on the right is a probabilistic graphical model drawn from a paper on the “Latent Personas of Film Characters” by Bamman, O’Connor, and Smith. The model represents a network of conditional relationships between variables. Some of those variables can be observed (like words in a plot summary w and external information about the film being summarized md), but some have to be inferred, like recurring character types (p) that are hypothesized to structure film narrative.

Having empirically observed the effects of illustrations like this on literary scholars, I can report that they produce deep, Lovecraftian horror. Nothing looks bristlier and more positivist than plate notation.

But I think this is a tragic miscommunication produced by language barriers that both sides need to overcome. The point of model-building is actually to address the reservations and nuances that humanists correctly want to interject whenever the concept of “measurement” comes up. Many concepts can’t be directly measured. In fact, many of our critical concepts are only provisional hypotheses about unseen categories that might (or might not) structure literary discourse. Before we can attempt to operationalize those categories, we need to make underlying assumptions explicit. That’s precisely what a model allows us to do.

It’s probably going to turn out that many things are simply beyond our power to model: ideology and social change, for instance, are very important and not at all easy to model quantitatively. But I think Moretti is absolutely right that literary scholars have a lot to gain by trying to operationalize basic concepts like genre and character. In some cases we may be able to do that by direct measurement; in other cases it may require model-building. In some cases we may come away from the enterprise with a better definition of existing concepts; in other cases those concepts may dissolve in our hands, revealed as more unstable than even poststructuralists imagined. The only thing I would say confidently about this project is that it promises to be interesting.