machine learning transformer models

Science fiction hasn’t prepared us to imagine machine learning.

Science fiction did a great job preparing us for submarines and rockets. But it seems to be struggling lately. We don’t know what to hope for, what to fear, or what genre we’re even in.

Space opera? Seems unlikely. And now that we’ve made it to 2021, the threat of zombie apocalypse is receding a bit. So it’s probably some kind of cyberpunk. But there are many kinds of cyberpunk. Should we get ready to fight AI or to rescue replicants from a sinister corporation? It hasn’t been obvious. I’m writing this, however, because recent twists in the plot seem to clear up certain mysteries, and I think it’s now possible to guess which subgenre the 2020s are steering toward.

Clearly some plot twist involving machine learning is underway. It’s been hard to keep up with new developments: from BERT (2018) to GPT-3 (2020)—which can turn a prompt into an imaginary news story—to, most recently, CLIP and DALL-E (2021), which can translate verbal descriptions into images.

Output from DALL-E. If you prefer, you can have a baby daikon radish in a tutu walking a dog.

I have limited access to DALL-E, and can’t test it in any detail. But if we trust the images released by Open AI, the model is good at fusing and extrapolating abstractions: it not only knows what it means for a lemur to hold an umbrella, but can produce a surprisingly plausible “photo of a television from the 1910s.” All of this is impressive for a research direction that isn’t much more than four years old.

The prompt here is “a photo of a television from the …<fill in the decade>”

On the other hand, some AI researchers don’t believe these models are taking the field in the direction it was supposed to go. Gary Marcus and Ernest Davies, for instance, doubt that GPT-3 is “an important step toward artificial general intelligence—the kind that would … reason broadly in a manner similar to humans … [GPT-3] learns correlations between words, and nothing more.”

People who want to contest that claim can certainly find evidence on the other side of the question. I’m not interested in pursuing the argument here. I just want to know why recent advances in deep learning give me a shivery sense that I’ve crossed over into an unfamiliar genre. So let’s approach the question from the other side: what if these models are significant because they don’t reason “in a manner similar to humans”?

It is true, after all, that models like DALL-E and GPT-3 are only learning (complex, general) patterns of association between symbols. When GPT-3 generates a sentence, it is not expressing an intention or an opinion—just making an inference about the probability of one sentence in a vast “latent space” of possible sentences implied by its training data.

When I say “a vast latent space,” I mean really vast. This space includes, for instance, the thoughts Jerome K. Jerome might have expressed about Twitter if he had lived in our century.

Mario Klingemann gets GPT-3 to extrapolate from a title and a byline.

But a latent space, however vast, is still quite different from goal-driven problem solving. In a sense the chimpanzee below is doing something more like human reasoning than a language model can.

Primates, understandably, envision models of the world as things individuals create in order to reach bananas. (Ultimately from Wolfgang Köhler, The Mentality of Apes, 1925.)

Like us, the chimpanzee has desires and goals, and can make plans to achieve them. A language model does none of that by itself—which is probably why language models are impressive at the paragraph scale but tend to wander if you let them run for pages.

So where does that leave us? We could shrug off the buzz about deep learning, say “it’s not even as smart as a chimpanzee yet,” and relax because we’re presumably still living in a realist novel.

And yes, to be sure, deep learning is in its infancy and will be improved by modeling larger-scale patterns. On the other hand, it would be foolish to ignore early clues about what it’s good for. There is something bizarrely parochial about a view of mental life that makes predicting a nineteenth-century writer’s thoughts about Twitter less interesting than stacking boxes to reach bananas. Perhaps it’s a mistake to assume that advances in machine learning are only interesting when they resemble our own (supposedly “general”) intelligence. What if intelligence itself is overrated?

The collective symbolic system we call “culture,” for instance, coordinates human endeavors without being itself intelligent. What if models of the world (including models of language and culture) are important in their own right—and needn’t be understood as attempts to reproduce the problem-solving behavior of individual primates? After all, people are already very good at having desires and making plans. We don’t especially need a system that will do those things for us. But we’re not great at imagining the latent space of (say) all protein structures that can be created by folding amino acids. We could use a collaborator there.

Storytelling seems to be another place where human beings sense a vast space of latent possibility, and tend to welcome collaborators with maps. Look at what’s happening to interactive fiction on sites like AI Dungeon. Tens of thousands of users are already making up stories interactively with GPT-3. There’s a subreddit devoted to the phenomenon. Competitors are starting to enter the field. One startup, Hidden Door, is trying to use machine learning to create a safe social storytelling space for children. For a summary of what collaborative play can build, we could do worse than their motto: “Worlds with Friends.”

It’s not hard to see how the “social play” model proposed by Hidden Door could eventually support the form of storytelling that grown-ups call fan fiction. Characters or settings developed by one author might be borrowed by others. Add something like DALL-E, and writers could produce illustrations for their story in a variety of styles—from Arthur Rackham to graphic novel.

Will a language model ever be as good as a human author? Can it ever be genuinely original? I don’t know, and I suspect those are the wrong questions. Storytelling has never been a solitary activity undertaken by geniuses who invent everything from scratch. From its origin in folk tales, fiction has been a game that works by rearranging familiar moves, and riffing on established expectations. Machine learning is only going to make the process more interactive, by increasing the number of people (and other agents) involved in creating and exploring fictional worlds. The point will not be to replace human authors, but to make the universe of stories bigger and more interconnected.

Storytelling and protein folding are two early examples of domains where models will matter not because they’re “intelligent,” but because they allow us—their creators—to collaboratively explore a latent space of possibility. But I will be surprised if these are the only two places where that pattern emerges. Music and art, and other kinds of science, are probably open to the same kind of exploration.

This collaborative future could be weirder than either science fiction or journalism have taught us to expect. News stories about ML invariably invite readers to imagine autonomous agents analogous to robots: either helpful servants or inscrutable antagonists like the Terminator and HAL. Boring paternal condescension or boring dread are the only reactions that seem possible within this script.

We need to be considering a wider range of emotions. Maybe a few decades from now, autonomous AI will be a reality and we’ll have to worry whether it’s servile or inscrutable. Maybe? But that’s not the genre we’re in at the moment. Machine learning is already transforming our world, but the things that should excite and terrify us about the next decade are not even loosely analogous to robots. We should be thinking instead about J. L. Borges’ Library of Babel—a vast labyrinth containing an infinite number of books no eye has ever read. There are whole alternate worlds on those shelves, but the Library is not a robot, an alien, or a god. It is just an extrapolation of human culture.

Eric Desmazieres, “The Library of Babel.”

Machine learning is going to be, let’s say, a thread leading us through this Library—or perhaps a door that can take us to any bookshelf we imagine. So if the 2020s are a subgenre of SF, I would personally predict a mashup of cyberpunk and portal fantasy. With sinister corporations, of course. But also more wardrobes, hidden doors, encylopedias of Tlön, etc., than we’ve been led to expect in futuristic fiction.

I’m not saying this will be a good thing! Human culture itself is not always a good thing, and extrapolating it can take you places you don’t want to go. For instance, movements like QAnon make clear that human beings are only too eager to invent parallel worlds. Armored with endlessly creative deepfakes, those worlds might become almost impenetrable. So we’re probably right to fear the next decade. But let’s point our fears in a useful direction, because we have more interesting things to worry about than a servant who refuses to “open the pod bay doors.” We are about to be in a Borges story, or maybe, optimistically, the sort of portal fantasy where heroines create doors with a piece of chalk and a few well-chosen words. I have no idea how our version of that story ends, but I would put a lot of money on “not boring.”

fiction plot

How predictable is fiction?

This blog post is loosely connected to a talk I’m giving (virtually) at the Workshop on Narrative Understanding, Storylines, and Events at the ACL. It’s an informal talk, exploring some of the challenges and opportunities we encounter when we take the impressive sentence-level tools of contemporary NLP and try to use them to produce insights about book-length documents.

Questions about the “predictability” of fiction started to interest me after I read a preprint by Maarten Sap et al. on the difference between “recollected” and “imagined” stories. There’s a lot in the paper, but the thing that especially caught my eye was that a neural language model (GPT) does better predicting the next sentence in imagined stories than in recollected stories about biographical events. The authors persuasively interpret this as a sign that imagined stories have been streamlined by a process of “narrativization.”

The stories in that article are very short narratives made up (or recalled) by experimental subjects. But, given my background in literary history, I wondered whether the same contrast might appear between book-length works of fiction and biography. Are fictional narratives in some sense more predictable than nonfiction?

One could say we already know the answer. Fiction is governed by plot conventions, so of course it makes sense that it’s predictable! But an equally intuitive argument could be made that fiction entertains readers by baffling and eluding their expectations about what, specifically, will happen next. Perhaps it ought to be less predictable than nonfiction? In short, there are basic questions about fiction that don’t have clear general answers yet, although we’re getting better at framing the questions. (See e.g. Caroline Levine on The Serious Pleasures of Suspense, Vera Tobin on Elements of Surprise, or Andrew Piper’s chapters on “Plot” and “Fictionality” in Enumerations.)

Plus, even if it were intuitively obvious that fiction is more strongly governed by plot conventions than by surprise, it might be interesting to measure the strength of those conventions in particular works. If we could do that, we’d have new evidence for a host of familiar debates about tradition and innovation.

So, how to do it? Sap et al. measure “narrative flow” by using a neural language model that can judge whether a sentence is likely to occur in a given context. It’s a good strategy for paragraph or page-sized stories, but I suspect sentences may be too small to capture the things we would call “predictable plot patterns” in novels. However, it wasn’t hard to give this strategy a spin, so I did, using a language model called BERT to assess pairs of sentences from 32 biographies and 32 novels. (This is just a toy-sized sample for a semi-thought-experiment; I’m not pretending to finally resolve anything.) At each step, in each book, I asked BERT to judge the probability that sentence B would really follow sentence A. (The code I used is in a GitHub repo.)

The result I got was the opposite of the one reported in Sap et al. There is a statistically significant difference between biography and fiction, but the pairs of sentences in biography appeared more predictable—more likely to follow each other—than the sentences in fiction. I hasten to say, however, that this could be wrong in several ways. First, BERT’s perception that two sentences are likely to follow each other correlates strongly with the length of the sentences. Short sentences (like most sentences in dialogue) seem less clearly connected. Since there’s a lot of dialogue in published fiction, BERT might be, in effect, biased against fiction.

Fig. 1. Two different ways of measuring continuity between some sample sentences.

More importantly, sentence-level continuity isn’t necessarily a good measure of surprise in novel-length works. For instance, in fig. 1, you’ll notice that BERT is unruffled when Pride and Prejudice morphs into Flatland. As long as each sentence picks up some discursive cue from the one before, BERT perceives the pairs as plausibly connected. But by the fourth sentence in the chain, Mr Bennet is listening to a lecture from a translucent, blue, four-dimensional being in his sitting room. Human readers would probably be surprised if this happened.

There are ways to generate “sentence embeddings” that might correspond more closely to human surprise. (This is a crowded field, but see for instance Sentence-BERT, Reimers and Gurevych 2019.) Even primitive 2014-era GloVe embeddings do a somewhat better job (Pennington, Socher, and Manning 2014). By averaging the GloVe embeddings for all the words in a sentence, we can represent each sentence as a vector of length 300. Then we can measure the cosine distances between sentences, as I’ve done in the third column of Fig 1. (Here, large numbers indicate a big gap between sentences; it’s the reverse of the “probability” measure provided by BERT, where high numbers represent continuity.) This model of distance is (appropriately) more surprised by the humming blue sphere in row three than by the short sentence of dialogue in row five.

But even if we had a good measure of continuity, sentences might just be too small to capture the patterns that count as “predictability” in a novel. As the example in fig. 1 suggests, a sequence of short steps, individually unsurprising, can leave the reader in a world very different from the place they started. Continuity of this kind is not the “predictability” we would want to measure at book scale.

When readers talk about predictable or unpredictable stories, they’re probably thinking about specific problem situations and possible outcomes. Will the protagonist marry suitor A or suitor B? Can we guess? It may soon be possible to automatically extract implicit questions of this kind from fiction. And the Story Cloze task (Mostafazadeh et al.) showed that it’s possible to answer “what happens next” at paragraph scale. But right now I don’t know how to extract implicit questions, or answer them, at the scale of a novel. So let’s try a simpler—in fact minimal— predictive task. Given two passages selected at random from a book, can we predict which came first? Doing that won’t tell us anything about plot—if “plot” is a causal connection between events. But it will tell us whether book-length works are organized by any predictable large-scale patterns. (As we’ll see in a moment, this is a real question, and in some genres the answer might be “not really.”)

The vector-space representation we developed in the third column of Fig. 1 can be scaled up for this question. “Paragraphs” and “chapters” mean different things in different periods, so for now, it may be better simply to divide stories into arbitrary thousand-word passages. Each passage will be represented as a vector by averaging the GloVe embeddings for the words in it; we’ll subtract one passage from the other and use the difference to decide whether A came before B in the book, or vice-versa.

Fig. 2. Accuracy of sequence prediction for randomly selected pairs of passages from detective novels, or novels randomly selected from the whole Chicago Novel Corpus. Regularized logistic regression is trained on 47 volumes and tested on the 48th; the boxplots represent the range of mean accuracies for different held-out volumes.

Random accuracy for this task would be 50%, but a model trained on a reasonable number of novels can easily achieve 65-66%, especially if the novels are all in the same genre. That number may not sound impressive, but I suspect it’s not much worse than human accuracy would be—if a human reader were asked to draw the arrow of time connecting two random passages from an unfamiliar book.

In fact, why is it possible to do this at all? Since the two passages may be separated by a hundred-odd pages, our model clearly isn’t registering any logical relationships between events. Instead, it’s probably relying on patterns described in previous work by David McClure and Scott Enderle. McClure and Enderle have shown that there are strong linguistic gradients across narrative time in fiction. References to witnesses, guilt, and jail, for instance, tend to occur toward the end of a book (if they occur at all).

Fig. 3. David McClure, “A Hierarchical Cluster of Words Across Narrative Time,” 2017.

Our model may draw even stronger clues from simple shifts of rhetorical perspective like the one in figure 3: indefinite articles appear early in a book, when “a mysterious old man” enters “a room.” A few pages later, he will either acquire a name or become “the old man” in “the room.”

Fig. 4. David William McClure and Scott Enderle, “Distribution of Function Words Across Narrative Time in 50,000 Novels,” ADHO 2018.

We probably wouldn’t call that shift of perspective “plot.” On the other hand, before we dismiss these gradients as merely linguistic rather than narrative phenomena, it’s worth noting that they seem to be specific to fiction. When I try to use the same general strategy to predict the direction of time between pairs of passages in biographies, the model struggles to do better than random guessing. Even with the small toy sample I’m using below (32 novels and 32 biographies), there is clearly a significant difference between the two genres. So, although BERT may not see it, fictional narratives are more predictable than nonfiction ones when we back out to look at the gradient of time across a whole book. There is a much clearer difference between before and after in fiction.

Fig. 5. Range of accuracies for a regularized logistic regression model trained to identify the earlier of two 1000-word passages.

“A predictable difference between before and after” is something a good bit cruder than we ordinarily mean by “plot.” But the fact that this difference is specific to fiction makes me think that a model of this kind may after all confirm some part of what we meant in speculating “fictional plots are shaped by conventions that make them more predictable than nonfiction.”

Of course, to really understand plot, we will need to pair these loose book-sized arcs with a more detailed understanding of the way characters’ actions are connected as we move from one page to the next. For that kind of work, I invite you to survey the actual papers accepted for the Workshop on Narrative Understanding <gestures at the program>, which are advancing the state of the art, for instance, on event extraction.

But I can’t resist pointing out that even the crude vector-space model I have played with here can give us some leverage on page-level surprise, and in doing so, complicate the story I’ve just told. One odd detail I’ve noticed is that the predictability of a narrative at book scale (measured as our ability to predict the direction of time between two widely separated passages) correlates with a kind of unpredictability as we move from one sentence, page, or thousand-word passage to the next.

For instance, one way to describe the stability of a sequence is to measure “autocorrelation.” If we shift a time series relative to itself, moving it back by one step, how much does the original series correlate with the lagged version?

Fig 6. These are wholly imaginary curves to illustrate an idea.

A process with a lot of inertia (e.g., change in temperature across a year) might still have the same basic shape if we shift it backward eight hours. The amount of sunlight in Seattle, on the other hand, fluctuates daily and will be largely out of phase with itself if we shift it backward eight hours; the correlation between those two curves will be pretty low, or even negative as above.

Since we’re representing each passage of a book as a vector of 300 numbers, this gives us 300 time series—300 curves—for each volume. It is difficult to say what each curve represents; the individual components of a word embedding don’t come with interpretable labels. But we can measure the narrative’s general degree of inertia by asking how strongly these curves are, collectively, autocorrelated. Crudely: I shift each time series back one step (1000 words) and measure the Pearson correlation coefficient between the lagged and unlagged version. Then I take the mean correlation for all 300 series.*

Fig 7. Relationship between the volatility of the text (low autocorrelation) and accuracy of models that attempt to put two passages in the right order. Although there are more fiction volumes, we keep accuracy comparable by training on only 32 volumes at a time.

The result is unintuitive. You might think it would be easier to predict the direction of narrative time in books where variables change slowly—as temperature does—tracing a reliable arc. But instead it turns out that prediction is more accurate in books where these curves behave a bit like sunlight, fluctuating substantially every 1000 words. (The linear relationship with autocorrelation is r = -.237 in fig 7, though I suspect the real relationship isn’t linear.) Also, biography appears to be distinguished from fiction by higher autocorrelation (lower volatility).

So yes, fiction is more predictable than nonfiction across the sweep of a whole narrative (because the beginnings and ends of novels are rhetorically very distinct). But the same observation doesn’t necessarily hold as we move from page to page, or sentence to sentence. At that scale, fiction may be more volatile than nonfiction is. I don’t yet know why! We could speculate that this has something to do with an imperative to surprise the reader—but it might also be as simple as the alternation of dialogue and description, which creates a lot of rapid change in the verbal texture of fiction. In short, I’m pointing to a question rather than answering one. There appear to be several different kinds of “predictability” in narrative, and teasing them apart might give us some simple insights into the structural differences between fiction and nonfiction.

  • Postscript: Everything above is speculative and exploratory. I’ve shared some code and data in a repository, but I wouldn’t call it fully replicable. There are more sophisticated ways to measure autocorrelation. If any economists read this, it will occur to them that we could also “predict the future course of a story” using full vector autoregression or an ARIMA model. I’ve tried that, but my sense is that the results were actually dominated by the two factors explored separately above (before-and-after predictability and the autocorrelation of individual variables with themselves). Also, to make any of this really illuminate literary history, we will need a bigger and better corpus, allowing us to ask how patterns like this intersect with genre, prestige, and historical change. A group of researchers at Illinois, including Wenyi Shang and Peizhen Wu, are currently pursuing those questions.


Edwin A. Abbott, Flatland: A Romance of Many Dimensions (London: 1884).

Sanjeev Arora, Yingyu Liang, Tengyu Ma, “A Simple but Tough-to-Beat Baseline for Sentence Embeddings.” ICLR 2017.

Austen, Jane. Pride and Prejudice. London: Egerton, 1813.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding. NAACL 2019.

Caroline Levine, The Serious Pleasures of Suspense (Charlottesville, University of Virginia Press, 2003).

David McClure, “A Hierarchical Cluster of Words Across Narrative Time,” 2017.

Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, James Allen. A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories. NAACL 2016.

Shay Palachy, “Document Embedding Techniques: A Review of Notable Literature on the Topic,” Towards Data Science, September 9, 2019.

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. EMNLP 2014.

Andrew Piper, Enumerations (Chicago: University of Chicago Press, 2018).

Nils Reimers and Iryna Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,” EMNLP-IJCNLP 2019.

Maarten Sap, Eric Horvitz, Yejin Choi, Noah A. Smith, James Pennebaker. Recollection Versus Imagination: Exploring Human Memory and Cognition via Neural Language Models. ACL 2020.

Vera Tobin, Elements of Surprise: Our Mental Limits and the Satisfactions of Plot (Cambridge: Harvard University Press, 2018).

undigitized humanities

Humanists own the fourth dimension, and we should take pride in it.

This is going to be a short, sweet, slightly-basic blog post, because I just have a simple thing to say.

I was originally trained as a scholar of eighteenth- and nineteenth-century British literature. As I learn more about other disciplines, I have been pleased to find that they are just as self-conscious and theoretically reflective as the one where I was trained. Every discipline has its own kind of theory.

But there is one thing that I still believe the humanities do better than any other part of the university: reflecting on historical change and on the historical mutability of the human mind. Lately social scientists (e.g. economic historians or physical anthropologists) can sometimes give us a run for our money. But humanists are more accustomed to the paradoxes that emerge when the rules of the game you’re playing can get historicized and contextualized, and change right under your feet. (We even have a word for it: “hermeneutics.”) So I think we still basically own the dimension of time.

At the moment, we aren’t celebrating that fact very much. Perhaps we’re still reeling from the late-20th-century discovery that the humanities’ connection to the past can be described as “cultural capital.” Ownership of the collective past is something people fight over, and the humanities had a central position in 19th- and 20th-century education partly because they had the social function of distributing that kind of authority.

Wariness about that social function is legitimate and necessary. However, I don’t think it can negate the basic fact that human beings are shaped and guided by a culture we inherit. In a very literal sense we can’t understand ourselves without understanding the past.

Screen Shot 2019-10-13 at 9.21.58 AM

I don’t think we can afford to play down this link to the past. At a moment when the humanities feel threatened by technological change, it may be tempting to get rid of anything that looks dusty. Out with seventeenth-century books, in with social media and sublimely complex network diagrams. Instead of identifying with the human past, we increasingly justify our fields of study by talking about “humanistic values.” The argument implicit (and sometimes explicit) in that gesture is that the humanities are distinguished from other disciplines not by what we study, but by studying it in a more critical or ethical way.

Maybe that will work. Maybe the world will decide that it needs us because we are the only people preserving ethical reflection in an otherwise fallen age of technology. But I don’t know. There isn’t a lot of evidence that humanists are actually, on average, more ethical than other people. And even if there were good evidence, “I am more critical and ethical than ye” is the kind of claim that often proves a hard sell.

But that’s a big question, and the jury is out. And anyway the humanities don’t need more negativity at the moment. I mainly want to underline a positive point, which is that historical change is a big deal for hominids. Its importance isn’t declining. We are self-programmed creatures, and it is a basic matter of self-respect to try to understand how and when we got our instructions. Humanists are people who try to understand that story, and we should take pride in our connection to time.

Screen Shot 2019-10-13 at 9.56.46 AM


fiction genre comparison transformer models

Do humanists need BERT?

This blog began as a space where I could tinker with unfamiliar methods. Lately I’ve had less time to do that, because I was finishing a book. But the book is out now—so, back to tinkering!

There are plenty of new methods to explore, because computational linguistics is advancing at a dizzying pace. In this post, I’m going to ask how historical inquiry might be advanced by Transformer-based models of language (like GPT and BERT). These models are handily beating previous benchmarks for natural language understanding. Will they also change historical conclusions based on text analysis? For instance, could BERT help us add information about word order to quantitative models of literary history that previously relied on word frequency? It is a slightly daunting question, because the new methods are not exactly easy to use.

I don’t claim to fully understand the Transformer architecture, although I get a feeling of understanding when I read this plain-spoken post by “nostalgebraist.” In essence Transformers capture information implicit in word order by allowing every word in a sentence—or in a paragraph—to have a relationship to every other word. For a fuller explanation, see the memorably-titled paper “Attention Is All You Need” (Vaswani et al. 2017). BERT is pre-trained on a massive English-language corpus; it learns by trying to predict missing words and put sentences in the right order (Devlin et al., 2018). This gives the model a generalized familiarity with the syntax and semantics of English. Users can then fine-tune the generic model for specific tasks, like answering questions or classifying documents in a particular domain.

Credit for meme goes to @Rachellescary.

Even if you have no intention of ever using the model, there is something thrilling about BERT’s ability to reuse the knowledge it gained solving one problem to get a head start on lots of other problems. This approach, called “transfer learning,” brings machine learning closer to learning of the human kind. (We don’t, after all, retrain ourselves from infancy every time we learn a new skill.) But there are also downsides to this sophistication. Frankly, BERT is still a pain for non-specialists to use. To fine-tune the model in a reasonable length of time, you need a GPU, and Macs don’t come with the commonly-supported GPUs. Neural models are also hard to interpret. So there is definitely a danger that BERT will seem arcane to humanists. As I said on Twitter, learning to use it is a bit like “memorizing incantations from a leather-bound tome.”

I’m not above the occasional incantation, but I would like to use BERT only where necessary. Communicating to a wide humanistic audience is more important to me than improving a model by 1%. On the other hand, if there are questions where BERT improves our results enough to produce basically new insights, I think I may want a copy of that tome! This post applies BERT to a couple of different problems, in order to sketch a boundary between situations where neural language understanding really helps, and those where it adds little value.

I won’t walk the reader through the whole process of installing and using BERT, because there are other posts that do it better, and because the details of my own workflow are explained in the github repo. But basically, here’s what you need:

1) A computer with a GPU that supports CUDA (a language for talking to the GPU). I don’t have one, so I’m running all of this on the Illinois Campus Cluster, using machines equipped with a TeslaK40M or K80 (I needed the latter to go up to 512-word segments).

2) The PyTorch module of Python, which includes classes that implement BERT, and translate it into CUDA instructions.

3) The BERT model itself (which is downloaded automatically by PyTorch when you need it). I used the base uncased model, because I wanted to start small; there are larger versions.

4) A few short Python scripts that divide your data into BERT-sized chunks (128 to 512 words) and then ask PyTorch to train and evaluate models. The scripts I’m using come ultimately from HuggingFace; I borrowed them via Thilina Rajapakse, because his simpler versions appeared less intimidating than the original code. But I have to admit: in getting these scripts to do everything I wanted to try, I sometimes had to consult the original HuggingFace code and add back the complexity Rajapakse had taken out.

Overall, this wasn’t terribly painful: getting BERT to work took a couple of days. Dependencies were, of course, the tricky part: you need a version of PyTorch that talks to your version of CUDA. For more details on my workflow (and the code I’m using), you can consult the github repo.

So, how useful is BERT? To start with, let’s consider how it performs on a standard sentiment-analysis task: distinguishing positive and negative opinions in 25,000 movie reviews from IMDb. It takes about thirty minutes to convert the data into BERT format, another thirty to fine-tune BERT on the training data, and a final thirty to evaluate the model on a validation set. The results blow previous benchmarks away. I wrote a casual baseline using logistic regression to make predictions about bags of words; BERT easily outperforms both my model and the more sophisticated model that was offered as state-of-the-art in 2011 by the researchers who developed the IMDb dataset (Maas et al. 2011).

Accuracy on the IMDb dataset from Maas et al.; classes are always balanced; the “best BoW” figure is taken from Maas et al.

I suspect it is possible to get even better performance from BERT. This was a first pass with very basic settings: I used the bert-base-uncased model, divided reviews into segments of 128 words each, ran batches of 24 segments at a time, and ran only a single “epoch” of training. All of those choices could be refined.

Note that even with these relatively short texts (the movie reviews average 234 words long), there is a big difference between accuracy on a single 128-word chunk and on the whole review. Longer texts provide more information, and support more accurate modeling. The bag-of-words model can automatically take full advantage of length, treating the whole review as a single, richly specified entity. BERT is limited to a fixed window; when texts are longer than the window, it has to compensate by aggregating predictions about separate chunks (“voting” or averaging them). When I force my bag-of-words model to do the same thing, it loses some accuracy—so we can infer that BERT is also handicapped by the narrowness of its window.

But for sentiment analysis, BERT’s strengths outweigh this handicap. When a review says that a movie is “less interesting than The Favourite,” a bag-of-words model will see “interesting!” and “favorite!” BERT, on the other hand, is capable of registering the negation.

Okay, but this is a task well suited to BERT: modeling a boundary where syntax makes a big difference, in relatively short texts. How does BERT perform on problems more typical of recent work in cultural analytics—say, questions about genre in volume-sized documents?

The answer is that it struggles. It can sometimes equal, but rarely surpass, logistic regression on bags of words. Since I thought BERT would at least equal a bag-of-words model, I was puzzled by this result, and didn’t believe it until I saw the same code working very well on the sentiment-analysis task above.

The accuracy of models predicting genre. Boxplots reflect logistic regression on bags of words; we run 30 train/test/validation splits and plot the variation. For BERT, I ran a half-dozen models for each genre and plotted the best result. Small b is accuracy on individual chunks; capital B after aggregating predictions at volume level. All models use 250 volumes evenly drawn from positive and negative classes. BERT settings are usually 512 words / 2 epochs, except for the detective genre, which seemed to perform better at 256/1. More tuning might help there.

Why can’t BERT beat older methods of genre classification? I am not entirely sure yet. I don’t think BERT is simply bad at fiction, because it’s trained on Google Books, and Sims et al. get excellent results using BERT embeddings on fiction at paragraph scale. What I suspect is that models of genre require a different kind of representation—one that emphasizes subtle differences of proportion rather than questions of word sequence, and one that can be scaled up. BERT did much better on all genres when I shifted from 128-word segments to 256- and then 512-word lengths. Conversely, bag-of-words methods also suffer significantly when they’re forced to model genre in a short window: they lose more accuracy than they lost modeling movie reviews, even after aggregating multiple “votes” for each volume.

It seems that genre is expressed more diffusely than the opinions of a movie reviewer. If we chose a single paragraph randomly from a work of fiction, it wouldn’t necessarily be easy for human eyes to categorize it by genre. It is a lovely day in Hertfordshire, and Lady Cholmondeley has invited six guests to dinner. Is this a detective story or a novel of manners? It may remain hard to say for the first twenty pages. It gets easier after her nephew gags, turns purple and goes face-first into the soup course, but even then, we may get pages of apparent small talk in the middle of the book that could have come from a different genre. (Interestingly, BERT performed best on science fiction. This is speculative, but I tend to suspect it’s because the weirdness of SF is more legible locally, at the page level, than is the case for other genres.)

Although it may be legible locally in SF, genre is usually a question about a gestalt, and BERT isn’t designed to trace boundaries between 100,000-word gestalts. Our bag-of-words model may seem primitive, but it actually excels at tracing those boundaries. At the level of a whole book, subtle differences in the relative proportions of words can distinguish detective stories from realist novels with sordid criminal incidents, or from science fiction with noir elements.

I am dwelling on this point because the recent buzz around neural networks has revivified an old prejudice against bag-of-words methods. Dissolving sentences to count words individually doesn’t sound like the way human beings read. So when people are first introduced to this approach, their intuitive response is always to improve it by adding longer phrases, information about sentence structure, and so on. I initially thought that would help; computer scientists initially thought so; everyone does, initially. Researchers have spent the past thirty years trying to improve bags of words by throwing additional features into the bag (Bekkerman and Allan 2003). But these efforts rarely move the needle a great deal, and perhaps now we see why not.

BERT is very good at learning from word order—good enough to make a big difference for questions where word order actually matters. If BERT isn’t much help for classifying long documents, it may be time to conclude that word order just doesn’t cast much light on questions about theme and genre. Maybe genres take shape at a level of generality where it doesn’t really matter whether “Baroness poisoned nephew” or “nephew poisoned Baroness.”

I say “maybe” because this is just a blog post based on one week of tinkering. I tried varying the segment length, batch size, and number of epochs, but I haven’t yet tried the “large” or “cased” pre-trained models. It is also likely that BERT could improve if given further pre-training on fiction. Finally, to really figure out how much BERT can add to existing models of genre, we might try combining it in an ensemble with older methods. If you asked me to bet, though, I would bet that none of those stratagems will dramatically change the outlines of the picture sketched above. We have at this point a lot of evidence that genre classification is a basically different problem from paragraph-level NLP.

Anyway, to return to the question in the title of the post: based on what I have seen so far, I don’t expect Transformer models to displace other forms of text analysis. Transformers are clearly going to be important. They already excel at a wide range of paragraph-level tasks: answering questions about a short passage, recognizing logical relations between sentences, predicting which sentence comes next. Those strengths will matter for classification boundaries where syntax matters (like sentiment). More importantly, they could open up entirely new avenues of research: Sims et al. have been using BERT embeddings for event detection, for instance—implying a new angle of attack on plot.

But volume-scale questions about theme and genre appear to represent a different sort of modeling challenge. I don’t see much evidence that BERT will help there; simpler methods are actually tailored to the nature of this task with a precision we ought to appreciate.

Finally, if you’re on the fence about exploring this topic, it might be shrewd to wait a year or two. I don’t believe Transformer models have to be hard to use; they are hard right now, I suspect, mostly because the technology isn’t mature yet. So you may run into funky issues about dependencies, GPU compatibility, and so on. I would expect some of those kinks to get worked out over time; maybe eventually this will become as easy as “from sklearn import bert”?


Bekkerman, Ron, and James Allan. “Using Bigrams in Text Categorization.” 2003.

Devlin, Jacob, Ming-Wei Chan, Kenton Lee, and Kristina Toutonova. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. 2018.

HuggingFace. “PyTorch Pretrained BERT: The Big and Extending Repository of Pretrained Transformers.”

Maas, Andrew, et al. “Learning Word Vectors for Sentiment Analysis.” 2011.

Rajapakse, Thilina. “A Simple Guide to Using BERT for Binary Text Classification.” 2019.

Sims, Matthew, Jong Ho Park, and David Bamman. “Literary Event Detection.” 2019.

Underwood, Ted. “The Life Cycles of Genres.” The Journal of Cultural Analytics. 2015.

Vaswani, Ashish, et al. “Attention Is All You Need.” 2017.








fiction problems of scale topic modeling

Do topic models warp time?

Recently, historians have been trying to understand cultural change by measuring the “distances” that separate texts, songs, or other cultural artifacts. Where distances are large, they infer that change has been rapid. There are many ways to define distance, but one common strategy begins by topic modeling the evidence. Each novel (or song, or political speech) can be represented as a distribution across topics in the model. Then researchers estimate the pace of change by measuring distances between topic distributions.

In 2015, Mauch et al. used this strategy to measure the pace of change in popular music—arguing, for instance, that changes linked to hip-hop were more dramatic than the British invasion. Last year, Barron et al. used a similar strategy to measure the influence of speakers in French Revolutionary debate.

I don’t think topic modeling causes problems in either of the papers I just mentioned. But these methods are so useful that they’re likely to be widely imitated, and I do want to warn interested people about a couple of pitfalls I’ve encountered along the road.

One reason for skepticism will immediately occur to humanists: are human perceptions about difference even roughly proportional to the “distances” between topic distributions? In one case study I examined, the answer turned out to be “yes,” but there are caveats attached. Read the paper if you’re curious.

In this blog post, I’ll explore a simpler and weirder problem. Unless we’re careful about the way we measure “distance,” topic models can warp time. Time may seem to pass more slowly toward the edges of a long topic model, and more rapidly toward its center.

For instance, suppose we want to understand the pace of change in fiction between 1885 and 1984. To make sure that there is exactly the same amount of evidence in each decade, we might randomly select 750 works in each decade, and reduce each work to 10,000 randomly sampled words. We topic-model this corpus. Now, suppose we measure change across every year in the timeline by calculating the average cosine distance between the two previous years and the next two years. So, for instance, we measure change across the year 1911 by taking each work published in 1909 or 1910, and comparing its topic proportions (individually) to every work published in 1912 or 1913. Then we’ll calculate the average of all those distances. The (real) results of this experiment are shown below.


Perhaps we’re excited to discover that the pace of change in fiction peaks around 1930, and declines later in the twentieth century. It fits a theory we have about modernism! Wanting to discover whether the decline continues all the way to the present, we add 25 years more evidence, and create a new topic model covering the century from 1910 to 2009. Then we measure change, once again, by measuring distances between topic distributions. Now we can plot the pace of change measured in two different models. Where they overlap, the two models are covering exactly the same works of fiction. The only difference is that one covers a century (1885-1984) centered at 1935, and the other a century (1910-2009) centered at 1960.


But the two models provide significantly different pictures of the period where they overlap. 1978, which was a period of relatively slow change in the first model, is now a peak of rapid change. On the other hand, 1920, which was a point of relatively rapid change, is now a trough of sluggishness.

Puzzled by this sort of evidence, I discussed this problem with Laure Thompson and David Mimno at Cornell, who suggested that I should run a whole series of models using a moving window on the same underlying evidence. So I slid a 100-year window across the two centuries from 1810 to 2009 in five 25-year steps. The results are shown below; I’ve smoothed the curves a little to make the pattern easier to perceive.


The models don’t agree with each other well at all. You may also notice that all these curves are loosely n-shaped; they peak at the middle and decline toward the edges (although sometimes to an uneven extent). That’s why 1920 showed rapid change in a model centered at 1935, but became a trough of sloth in one centered at 1960. To make the pattern clearer we can directly superimpose all five models and plot them on an x-axis using date relative to the model’s timeline (instead of absolute date).


The pattern is clear: if you measure the pace of change by comparing documents individually, time is going to seem to move faster near the center of the model. I don’t entirely understand why this happens, but I suspect the problem is that topic diversity tends to be higher toward the center of a long timeline. When the modeling process is dividing topics, phenomena at the edges of the timeline may fall just below the threshold to form a distinct topic, because they’re more sparsely represented in the corpus (just by virtue of being near an edge). So phenomena at the center will tend to be described with finer resolution, and distances between pairs of documents will tend to be greater there. (In our conversation about the problem, David Mimno ran a generative simulation that produced loosely similar behavior.)

To confirm that this is the problem, I’ve also measured the average cosine distance, and Kullback-Leibler divergence, between pairs of documents in the same year. You get the same n-shaped pattern seen above. In other words, the problem has nothing to do with rates of change as such; it’s just that all distances tend to be larger toward the center of a topic model than at its edges. The pattern is less clearly n-shaped with KL divergence than with cosine distance, but I’ve seen some evidence that it distorts KL divergence as well.

But don’t panic. First, I doubt this is a problem with topic models that cover less than a decade or two. On a sufficiently short timeline, there may be no systematic difference between topics represented at the center and at the edges. Also, this pitfall is easy to avoid if we’re cautious about the way we measure distance. For instance, in the example above I measured cosine distance between individual pairs of documents across a 5-year period, and then averaged all the distances to create an “average pace of change.” Mathematically, that way of averaging things is slighly sketchy, for reasons Xanda Schofield explained on Twitter:


The mathematics of cosine distance tend to work better if you average the documents first, and then measure the cosine between the averages (or “centroids”). If you take that approach—producing yearly centroids and comparing the centroids—the five overlapping models actually agree with each other very well.


Calculating centroids factors out the n-shaped pattern governing average distances between individual books, and focuses on the (smaller) component of distance that is actually year-to-year change. Lines produced this way agree very closely, even about individual years where change seems to accelerate. As substantive literary history, I would take this evidence with a grain of salt: the corpus I’m using is small enough that the apparent peaks could well be produced by accidents of sampling. But the math itself is working.

I’m slightly more confident about the overall decline in the pace of change from the nineteenth century to the twenty-first. Although it doesn’t look huge on this graph, that pattern is statistically quite strong. But I would want to look harder before venturing a literary interpretation. For instance, is this pattern specific to fiction, or does it reflect a broadly shared deceleration in underlying rates of linguistic change? As I argued in a recent paper, supervised models may be better than raw distance measures at answering that culturally-specific question.

But I’m wandering from the topic of this post. The key observation I wanted to share is just that topic models produce a kind of curved space when applied to long timelines; if you’re measuring distances between individual topic distributions, it may not be safe to assume that your yardstick means the same thing at every point in time. This is not a reason for despair: there are lots of good ways to address the distortion. But it’s the kind of thing researchers will want to be aware of.


disciplinary history

Remarks for a panel on data science in literary studies, in 2028

A transcript of these remarks was sent back via time machine to the Novel Theory conference at Ithaca in 2018, where a panel had been asked to envision what literary studies would look like if data analysis came to be understood as a normal part of the discipline.

I want to congratulate the organizers on their timeliness; 2028 is the perfect time for this retrospective. Even ten years ago I think few of us could have imagined that quantitative methods would become as uncontroversial in literary studies as they are today. (I, myself, might not have believed it.)

But the emergence of a data science option in the undergrad major changed a lot. It is true that the option only exists at a few schools, and most undergrads don’t opt for it. But it has made a few students confident enough to explore further. Today, almost 10% of dissertations in literary studies use numbers in some way.

A tenth of the field is not huge, but even that small foothold has had catalytic effects. One obvious effect has been to give literary studies a warm-water port in the social sciences. Without data science, I’m not sure we would be having the vigorous conversations we’re having today with sociologists, legal scholars, and historians of social media. Increasingly, our fields are bound together by a shared concern with large-scale textual interpretation. That shared conversation, in turn, invites journalists to give literary arguments a different kind of attention. There are examples everywhere. But since we’re all reading it, let me just point to today’s article in The Guardian on the Trump Prison Diaries—which couldn’t have been written ten years ago, for several obvious reasons.

But data analysis has also led to less obvious changes. I think even the recent, widely-discussed return to evaluative criticism—the so-called “new belletrism”—may have had something to do with data.

The conference venue, easily reached by water taxi.

I know this will seem unlikely. These are usually presented as opposing currents in the contemporary scene—one artsy, one mathy; one tending to focus on contemporary literature, the other on the longue durée. But I would argue that quantitative methods have made it easier to treat aesthetic arguments as scholarly questions.

It used to be difficult, after all, to reconcile evaluation with historicism. If you disavowed timeless aesthetic judgments, then it seemed you could only do historical reportage on the peculiar opinions of the 1820s or the 1920s.

Work on the history of reception has created an expansive middle ground between those poles—a way to study judgments that do change, but change in practice very slowly, across century-spanning arcs. Those arcs became visible when we backed up to a distance, but in a sense we can’t get historical distance from them; they sprawl into the present and can’t be relegated to the past. So historical scholars and contemporary critics are increasingly forced onto each other’s turf. Witness the fireworks lately between MFA programs and distant readers arguing that the recent history of the novel is really driven by genre fiction.

Most of these fireworks, I think, are healthy. But there have also been downsides. Ten years ago, none of us imagined that divisions within the data science community could become as deep or bitter as they are today. In a general sense data may be uncontroversial: many literary scholars use, say, a table of sales figures. But machine learning is more controversial than ever.

Ironically, it is often the people most enthusiastic about other forms of data who are rejecting machine learning. And I have to admit they have a point when they call it “subjective.”

It is notoriously true, after all, that learning algorithms absorb the biases implicit in the evidence you give them. So in choosing evidence for our models we are, in a sense, choosing a historical vantage point. Some people see this as appropriate for an interpretive discipline. “We have always known that interpretation was circular,” they say, “and it’s healthy to acknowledge that our inferences start from a situated perspective” (Rosencrantz, 2025). Other people worry that the subjectivity of machine learning is troubling, because it “hides inside a black box” (Guildenstern, 2027). I don’t think the debate is going away soon; I fear literary theorists will still be arguing about it when we meet again in 2038.

reproducibility and replication

New methods need a new kind of conversation

Over the last decade, the (small) fraction of articles in the humanities that use numbers has slowly grown. This is happening partly because computational methods are becoming flexible enough to represent a wider range of humanistic evidence. We can model concepts and social practices, for instance, instead of just counting people and things.

That’s exciting, but flexibility also makes arguments complex and hard to review. Journal editors in the humanities may not have a long list of reviewers who can evaluate statistical models. So while quantitative articles certainly encounter some resistance, they don’t always get the kind of detailed resistance they need. I thought it might be useful to stir up conversation on this topic with a few suggestions, aimed less at the DH community than at the broader community of editors and reviewers in the humanities. I’ll start with proposals where I think there’s consensus, and get more opinionated as I go along.

1. Ask to see code and data.

Getting an informed reviewer is a great first step. But to be honest, there’s not a lot of consensus yet about many methodological questions in the humanities. What we need is less strict gatekeeping than transparent debate.

As computational methods spread in the sciences, scientists have realized that it’s impossible to discuss this work fruitfully if you can’t see how the work was done.  Journals like Cultural Analytics reflect this emerging consensus with policies that require authors to share code and data. But mainstream humanities journals don’t usually have a policy in place yet.

Three or four years ago, confusion on this topic was understandable. But in 2018, journals that accept quantitative evidence at all need a policy that requires authors to share code and data when they submit an article for review, and to make it public when the article is published.

I don’t think the details of that policy matter deeply. There are lots of different ways to archive code and data; they are all okay. Special cases and quibbles can be accomodated. For instance, texts covered by copyright (or other forms of IP) need not be shared in their original form. Derived data can be shared instead; that’s usually fine. (Ideally one might also share the code used to derive it.)

2. … especially code.

Humanists are usually skeptical enough about the data underpinning an argument, because decades of debate about canons have trained us to pose questions about the works an author chooses to discuss.

But we haven’t been trained to pose questions about the magnitude of a pattern, or the degree of uncertainty surrounding it. These aspects of a mathematical argument often deserve more discussion than an author initially provides, and to discuss them, we’re going to need to see the code.

I don’t think we should expect code to be polished, or to run easily on any machine. Writing an article doesn’t commit the author to produce an elegant software tool. (In fact, to be blunt, “it’s okay for academic software to suck.”) The author just needs to document what they did, and the best way to do that is to share the code and data they actually used, warts and all.

3. Reproducibility is great, but replication is the real point.

Ideally, the code and data supporting an article should permit a reader to reproduce all the stages of analysis the author(s) originally performed. When this is true, we say the research is “reproducible.”

But there are often rough spots in reproducibility. Stochastic processes may not run exactly the same way each time, for instance.

At this point, people who study reproducibility professionally will crowd forward and offer an eleven-point plan for addressing all rough spots. (“You just set the random number seed so it’s predictable …”)

That’s wonderful, if we really want to polish a system that allows a reader to push a button and get the same result as the original researcher, to the seventh decimal place. But in the humanities, we’re not always at the “polishing” stage of inquiry yet. Often, our question is more like “could this conceivably work? and if so, would it matter?”

In short, I think we shouldn’t let the imperative to share code foster a premature perfectionism. Our ultimate goal is not to prove that you get exactly the same result as the author if you use exactly the same assumptions and the same books. It’s to decide whether the experiment is revealing anything meaningful about the human past. And to decide that, we probably want to repeat the author’s question using different assumptions and a different sample of books.

When we do that, we are not reproducing the argument but replicating it. (See Language Log for a fuller discussion of the difference.) Replication is the real prize in most cases; that’s how knowledge advances. So the point of sharing code and data is often less to stabilize the results of your own work to the seventh decimal place, and more to guide investigators who may want to undertake parallel inquiries. (For instance, Jonathan Goodwin borrowed some of my code to pose a parallel question about Darko Suvin’s model of science fiction.)

I admit this is personal opinion. But I stress replication over reproducibility because it has some implications for the spirit of the whole endeavor. Since people often imagine that quantitative problems have a right answer, we may initially imagine that the point of sharing code and data is simply to catch mistakes.

In my view the point is rather to permit a (mathematical) conversation about the interpretation of the human past. I hope authors and readers will understand themselves as delayed collaborators, working together to explore different options. What if we did X differently? What if we tried a different sample of books? Usually neither sample is wrong, and neither is right. The point is to understand how much different interpretive assumptions do or don’t change our conclusions. In a sense no single article can answer that question “correctly”; it’s a question that has to be solved collectively, by returning to questions and adjusting the way we frame them. The real point of code-sharing is to permit that kind of delayed collaboration.


cultural analytics DH as a social phenomenon teaching

A broader purpose

The weather prevents me from being there physically, but this is a transcript of my remarks for “Varieties of Digital Humanities,” MLA, Jan 5, 2018.

Using numbers to understand cultural history is often called “cultural analytics”—or sometimes, if we’re talking about literary history in particular, “distant reading.” The practice is older than either name: sociologists, linguists, and adventurous critics like Janice Radway have been using quantitative methods for a long time.

But over the last twenty years, numbers have begun to have a broader impact on literary study, because we’ve learned to use them in a wider range of ways. We no longer just count things that happen to be easily counted (individual words, for instance, or books sold). Instead scholars can start with literary questions that really interest readers, and find ways to model them. Recent projects have cast light, for instance, on the visual impact of poetry, on imagined geography in the novel, on the instability of gender, and on the global diffusion of stream of consciousness. Articles that use numbers are appearing in central disciplinary venues: MLQ, Critical Inquiry, PMLA. Equally important: a new journal called Cultural Analytics has set high standards for transparent and reproducible research.

Of course, scholars still disagree with each other. And that’s part of what makes this field exciting. We aren’t simply piling up facts. New methods are sparking debate about the nature of the knowledge literary historians aim to produce. Are we interpreting the past or explaining it? Can numbers address perspectival questions? The name for these debates is “critical theory.” Twenty years from now, I think it will be clear that questions about quantitative models form an important unit in undergraduate theory courses.

Literary scholars are used to imagining numbers as tools, not as theories. So there’s translation work to be done. But translating between theoretical traditions could be the most important part of this project. Our existing tradition of critical theory teaches students to ask indispensable questions—about power, for instance, and the material basis of ideology. But persuasive answers to those questions will often require a lot of evidence, and the art of extracting meaningful patterns from evidence is taught by a different theoretical tradition, called “statistics.” Students will be best prepared for the twenty-first century if they can connect the two traditions, and do critical theory with numbers.

So in a lot of ways, this is a heady moment. Cultural analytics has historical discoveries, lively theoretical debates, and a public educational purpose. Intellectually, we’re in good shape.

But institutionally, we’re in awful shape. Or to be blunt: we are shape-less. Most literature departments do not teach students how to do this stuff at all. Everything I’ve just discussed may be represented by one unit in one course, where students play with topic models. Reduced to that size, I’m not sure cultural analytics makes any sense. If we were seriously trying to teach students to do critical theory with numbers, we would need to create a sequence of courses that guides them through basic principles (of statistical inference as well as historical interpretation) toward projects where they can pose real questions about the past.

What keeps us from building that curriculum? Part of the obstacle, I think, is the term digital humanities itself. Don’t get me wrong: I’m grateful for the popularity of DH. It has lent energy to many different projects. But the term digital humanities has been popular precisely because it promises that all those projects can still be contained in the humanities. The implicit pitch is something like this: “You won’t need a whole statistics course. Come to our two-hour workshop on topic models instead. You can always find a statistician to collaborate with.”

I understand why digital humanists said that kind of thing eight years ago. We didn’t want to frighten people away. If you write “Learn Another Discipline” on your welcome mat, you may not get many visitors. But a deceptively gentle welcome mat, followed by a trapdoor, is not really more welcoming. So it’s time to be honest about the preparation needed for cultural analytics. Young people entering this field will need to understand the whole process. They won’t even be able to pose meaningful questions, for instance, without some statistics.

Trompe l'oeil door mural
Trompe l’oeil faux door mural from http://www.bumblebee

But the metaphor of a welcome mat may be too optimistic. This field doesn’t have a door yet. I mean, there is no curriculum. So of course the field tends to attract people who already have an extracurricular background—which, of course, is not equally distributed. It shouldn’t surprise us that access is a problem when this field only exists as a social network. The point of a classroom is to distribute knowledge in a more equal, less homosocial way. But digital humanities classes, as currently defined, don’t really teach students how to use numbers. (For a bracingly honest exploration of the problem, see Andrew Goldstone.) So it’s almost naive to discuss “barriers to entry.” There is no entrance to this field. What we have is more like a door painted on the wall. But we’re in denial about that—because to admit the problem, we would have to admit that “DH” isn’t working as a gateway to everything it claims to contain.

I think the courses that can really open doors to cultural analytics are found, right now, in the social sciences. That’s why I recently moved half of my teaching to a School of Information Sciences. There, you find a curricular path that covers statistics and programming along with social questions about technology. I don’t think it’s an accident that you also find better gender and ethnic diversity among people using numbers in the social sciences. Methods get distributed more equally within a discipline that actually teaches the methods. So I recommend fusing cultural analytics with social science partly because it immediately makes this field more diverse. I’m not offering that as a sufficient answer to problems of access. I welcome other answers too. But I am suggesting that social-scientific methods are a necessary part of access. We cannot lower barriers to entry by continuing to pretend that cultural analytics is just the humanities, plus some user-friendly digital tools. That amounts to a trompe-l’oeil door.

What the social sciences lack are courses in literary history. And that’s important, because distant readers set out to answer concrete historical questions. So the unfortunate reality is, this project cannot be contained in one discipline.  The questions we try to answer are taught in the humanities. But the methods we use are taught, right now, in the social sciences and data science. Even if it frightens some students off, we have to acknowledge that cultural analytics is a multi-disciplinary project—a bridge between the humanities and quantitative social science, belonging equally to both.

I’m not recommending this approach for the DH community as a whole. DH has succeeded by fitting into the institutional framework of the humanities. DH courses are often pitched to English or History majors, and for many topics, that works brilliantly. But it’s awkward for quantitative courses. To use numbers wisely, students need preparation that an English major doesn’t provide. So increasingly I see the quantitative parts of DH presented as an interdisciplinary program rather than a concentration in the humanities.

dooropenIn saying this, I don’t mean to undersell the value of numbers for humanists. New methods can profoundly transform our view of the human past, and the research is deeply rewarding. So I’m convinced that statistics, and even machine learning, will gradually acquire a place in the humanistic curriculum.

I’m just saying that this is a bigger, slower project than the rhetoric of DH may have led us to expect. Mathematics doesn’t really come packaged in digital tools. Math is a way of thinking, and using it means entering into a long-term relationship with statisticians and social scientists. We are not borrowing tools for private use inside our discipline, but starting a theoretical conversation that should turn us outward, toward new forms of engagement with our colleagues and the world.

What is the point of studying culture with numbers, after all? It’s not to change English departments, but to enrich the way all students think about culture. The questions we’re posing can have real implications for the way students understand their roles in history—for instance, by linking their own cultural experience to century-spanning trends. Even more urgently, these questions give students a way to connect interpretive insights and resonant human details with habits of experimental inquiry.

Instead of imagining cultural analytics as a subfield of DH, I would almost call it an emerging way to integrate the different aspects of a liberal education. People who want to tackle that challenge are going to have to work across departments to some extent: it’s not a project that an English department could contain. But it is nevertheless an important opportunity for literary scholars, since it’s a place where our work becomes central to the broader purposes of the university as a whole.

disciplinary history

It looks like you’re writing an argument against data in literary study …

would you like some help with that?

I’m not being snarky. Right now, I have several friends writing articles that are largely or partly a critique of interrelated trends that go under the names “data” or “distant reading.” It looks like many other articles of the same kind are being written.

This is good news! I believe fervently in Mae West’s theory of publicity. “I don’t care what the newspapers say about me as long as they spell my name right.” (Though it turns out we may not actually know who said that, so I guess the newspapers failed.)

In any case, this blog post is not going to try to stop you from proving that numbers are neoliberal, unethical, inevitably assert objectivity, aim to eliminate all close reading from literary study, fail to represent time, and lead to loss of “cultural authority.” Go for it! Ideas live on critique.

But I do want to help you “spell our names right.” Andrew Piper has recently pointed out that critiques of data-driven research tend to use a small sample of articles. He expressed that more strongly, but I happen to like the article he was aiming at, so I’m going to soften his expression. However, I don’t disagree with the underlying point! For some reason, critics of numbers don’t feel they need to consider more than one example, or two if they’re in a generous mood.

There are some admirable exceptions to this rule. I’ve argued that a recent issue of Genre was, in general, moving in the right direction. And I’m fairly confident that the trend will continue. A field that has been generating mostly articles and pamphlets is about to shift into a lower gear and publish several books. In literary studies, that tends to be an effective way of reframing debate.

But it may be another twelve to eighteen months before those books are out. In the meantime, you’ve got to finish your critique. So let me help with the bibliography.

When you’re tempted to assume that all possible uses of numbers (or “data”) in literary study can be summed up by engaging one or two texts that Franco Moretti wrote in the year 2000, you should resist the assumption. You are actually talking about a long, complex story, and your readers deserve some glimpse of its complexity.

For instance, sociologists, linguists and book historians have been using numbers to describe literature since the middle of the twentieth century. You should make clear whether you are critiquing that work, or just arguing that it is incapable of addressing the inner literariness of literature. The journal Computers and the Humanities started in the 1960s. The 1980s gave rise to a thriving tradition of feminist literary sociology, embodied in books by Janice Radway and Gaye Tuchman, and in the journal Signs. I’ve used one of Tuchman’s regression models as an illustration here.

Variables predicting literary fame in a regression model, from Gaye Tuchman and Nina E. Fortin, Edging Women Out (1989).

<deep breath>

In the 1990s, Mark Olsen (working at the University of Chicago) started to articulate many of the impulses we now call “distant reading.” Around 2000, Franco Moretti gave quantitative approaches an infusion of polemical verve and wit, which raised their profile among literary scholars who had not previously paid attention. (Also, frankly, the fact that Moretti already had disciplinary authority to spend mattered a great deal. Literary scholars can be temperamentally conservative even when theoretically radical.)

But Moretti himself is a moving target. The articles he has written since 2008 aim at different goals, and use different methods, than articles before that date. Part of the point of an experimental method, after all, is that you are forced to revise your assumptions! Because we are actually learning things, this field is changing rapidly. A recent pamphlet from the Stanford Literary Lab conceives the role of the “archive,” for instance, very differently than “Slaughterhouse of Literature” did.

But that pamphlet was written by six authors—a useful reminder that this is a collective project. Today the phrase “distant reading” is often a loose description for large-scale literary history, covering many people who disagree significantly with Moretti. In a recent roundtable in PMLA, for instance, Andrew Goldstone argues for evidence of a more sociological and less linguistic kind. Lisa Rhody and Alison Booth both argue for different scales or forms of “distance.” Richard Jean So argues that the simple measurements which typified much work before 2010 need to be replaced by statistical models, which account for variation and uncertainty in a more principled way.

One might also point, for instance, to Lauren Klein’s work on gaps in the archive, or to Ryan Cordell’s work on literary circulation, or to Katherine Bode’s work, which aims to construct corpora that represent literary circulation rather than production. Or to Matt Wilkens, or Hoyt Long, or Tanya Clement, or Matt Jockers, or James F. English … I’m going to run out of breath before I run out of examples.

Not all of these scholars believe that numbers will put literary scholarship on a more objective footing. Few of them believe that numbers can replace “interpretation” with “explanation.” None of them, as far as I can tell, have stopped doing close reading. (I would even claim to pair numbers with close reading in Joseph North’s strong sense of the phrase: not just reading-to-illustrate-a-point but reading-for-aesthetic-cultivation.) In short, the work literary scholars are doing with numbers is not easily unified by a shared set of principles—and definitely isn’t unified by a 17-year-old polemic. The field is unified, rather, by a fast-moving theoretical debate. Literary production versus circulation. Book history versus social science. Sociology versus linguistics. Measurement versus modeling. Interpretation versus explanation versus prediction.

Critics of this work may want to argue that it all nevertheless fails in the same way, because numbers inevitably (flatten time/reduce reading to visualization/exclude subjectivity/fill in the blank). That’s a fair thesis to pursue. But if you believe that, you need to show that your generalization is true by considering several different (recent!) examples, and teasing out the tacit similarities concealed underneath ostensible disagreements. I hope this post has helped with some of the bibliographic legwork. If you want more sources, I recently wrote a “Genealogy of Distant Reading” that will provide more. Now, tear them apart!

disciplinary history interpretive theory

We’re probably due for another discussion of Stanley Fish

I think I see an interesting theoretical debate over the horizon. The debate is too big to resolve in a blog post, but I thought it might be narratively useful to foreshadow it—sort of as novelists create suspense by dropping hints about the character traits that will develop into conflict by the end of the book.

Basically, the problem is that scholars who use numbers to understand literary history have moved on from Stanley Fish’s critique, without much agreement about why or how. In the early 1970s, Fish gave a talk at the English Institute that defined a crucial problem for linguistic analysis of literature. Later published as “What Is Stylistics, and Why are They Saying Such Terrible Things About It?”, the essay focused on “the absence of any constraint” governing the move “from description to interpretation.” Fish takes Louis Milic’s discussion of Jonathan Swift’s “habit of piling up words in series” as an example. Having demonstrated that Swift does this, Milic concludes that the habit “argues a fertile and well stocked mind.” But Fish asks how we can make that sort of inference, generally, about any linguistic pattern. How do we know that reliance on series demonstrates a “well stocked mind” rather than, say, “an anal-retentive personality”?

The problem is that isolating linguistic details for analysis also removes them from the context we normally use to give them a literary interpretation. We know what the exclamation “Sad!” implies, when we see it at the end of a Trumpian tweet. But if you tell me abstractly that writer A used “sad” more than writer B, I can’t necessarily tell you what it implies about either writer. If I try to find an answer by squinting at word lists, I’ll often make up something arbitrary. Word lists aren’t self-interpreting.

Thirty years passed; the internet got invented. In the excitement, dusty critiques from the 1970s got buried. But Fish’s argument was never actually killed, and if you listen to the squeaks of bats, you hear rumors that it still walks at night.

Or you could listen to blogs. This post is partly prompted by a blogged excerpt from a forthcoming work by Dennis Tenen, which quotes Fish to warn contemporary digital humanists that “a relation can always be found between any number of low-level, formal features of a text and a given high-level account of its meaning.” Without “explanatory frameworks,” we won’t know which of those relations are meaningful.

Ryan Cordell’s recent reflections on “machine objectivity” could lead us in a similar direction. At least they lead me in that direction, because I think the error Cordell discusses—over-reliance on machines themselves to ground analysis—often comes from a misguided attempt to solve the problem of arbitrariness exposed by Fish. Researchers are attracted to unsupervised methods like topic modeling in part because those methods seem to generate analytic categories that are entirely untainted by arbitrary human choices. But as Fish explained, you can’t escape making choices. (Should I label this topic “sadness” or “Presidential put-downs”?)

I don’t think any of these dilemmas are unresolvable. Although Fish’s critique identified a real problem, there are lots of valid solutions to it, and today I think most published research is solving the problem reasonably well. But how? Did something happen since the 1970s that made a difference? There are different opinions here, and the issues at stake are complex enough that it could take decades of conversation to work through them. Here I just want to sketch a few directions the conversation could go.

Dennis Tenen’s recent post implies that the underlying problem is that our models of form lack causal, explanatory force. “We must not mistake mere extrapolation for an account of deep causes and effects.” I don’t think he takes this conclusion quite to the point of arguing that predictive models should be avoided, but he definitely wants to recommend that mere prediction should be supplemented by explanatory inference. And to that extent, I agree—although, as I’ll say in a moment, I have a different diagnosis of the underlying problem.

It may also be worth reviewing Fish’s solution to his own dilemma in “What Is Stylistics,” which was that interpretive arguments need to be anchored in specific “interpretive acts” (93). That has always been a good idea. David Robinson’s analysis of Trump tweets identifies certain words (“badly,” “crazy”) as signs that a tweet was written by Trump, and others (“tomorrow,” “join”) as signs that it was written by his staff. But he also quotes whole tweets, so you can see how words are used in context, make your own interpretive judgment, and come to a better understanding of the model. There are many similar gestures in Stanford LitLab pamphlets: distant readers actually rely quite heavily on close reading.

My understanding of this problem has been shaped by a slightly later Fish essay, “Interpreting the Variorum” (1976), which returns to the problem broached in “What Is Stylistics,” but resolves it in a more social way. Fish concludes that interpretation is anchored not just in an individual reader’s acts of interpretation, but in “interpretive communities.” Here, I suspect, he is rediscovering an older hermeneutic insight, which is that human acts acquire meaning from the context of human history itself. So the interpretation of culture inevitably has a circular character.

One lesson I draw is simply, that we shouldn’t work too hard to avoid making assumptions. Most of the time we do a decent job of connecting meaning to an implicit or explicit interpretive community. Pointing to examples, using word lists derived from a historical thesaurus or sentiment dictionary—all of that can work well enough. The really dubious moves we make often come from trying to escape circularity altogether, in order to achieve what Alan Liu has called “tabula rasa interpretation.”

But we can also make quantitative methods more explicit about their grounding in interpretive communities. Lauren Klein’s discussion of the TOME interface she constructed with Jacob Eisenstein is a good model here; Klein suggests that we can understand topic modeling better by dividing a corpus into subsets of documents (say, articles from different newspapers), to see how a topic varies across human contexts.

Of course, if you pursue that approach systematically enough, it will lead you away from topic modeling toward methods that rely more explicitly on human judgment. I have been leaning on supervised algorithms a lot lately—not because they’re easier to test or more reliable than unsupervised ones—but because they explicitly acknowledge that interpretation has to be anchored in human history.

At a first glance, this may seem to make progress impossible. “All we can ever discover is which books resemble these other books selected by a particular group of readers. The algorithm can only reproduce a category someone else already defined!” And yes, supervised modeling is circular. But this is a circularity shared by all interpretation of history, and it never merely reproduces its starting point. You can discover that books resemble each other to different degrees. You can discover that models defined by the responses of one interpretive community do or don’t align with models of another. And often you can, carefully, provisionally, draw explanatory inferences from the model itself, assisted perhaps by a bit of close reading.

I’m not trying to diss unsupervised methods here. Actually, unsupervised methods are based on clear, principled assumptions. And a topic model is already a lot more contextually grounded than “use of series == well stocked mind.” I’m just saying that the hermeneutic circle is a little slipperier in unsupervised learning, easier to misunderstand, and harder to defend to crowds of pitchfork-wielding skeptics.

In short, there are lots of good responses to Fish’s critique. But if that critique is going to be revived by skeptics over the next few years—as I suspect—I think I’ll take my stand for the moment on supervised machine learning, which can explicitly build bridges between details of literary language and social contexts of reception.  There are other ways to describe best practices: we could emphasize a need to seek “explanations,” or avoid claims of “objectivity.” But I think the crucial advance we have made over the 1970s is that we’re no longer just modeling language; we can model interpretive communities at the same time.

Photo credit: A school of yellow-tailed goatfish, photo for NOAA Photo Library, CC-BY Dwayne Meadows, 2004.

Postscript July 15: Jonathan Armoza points out that Stephen Ramsay wrote a post articulating his own, more deformative response to “What is Stylistics” in 2012.