About tedunderwood

Ted Underwood is Professor of Information Sciences and English at the University of Illinois, Urbana-Champaign. On Twitter he is @Ted_Underwood.

How predictable is fiction?

This blog post is loosely connected to a talk I’m giving (virtually) at the Workshop on Narrative Understanding, Storylines, and Events at the ACL. It’s an informal talk, exploring some of the challenges and opportunities we encounter when we take the impressive sentence-level tools of contemporary NLP and try to use them to produce insights about book-length documents.

Questions about the “predictability” of fiction started to interest me after I read a preprint by Maarten Sap et al. on the difference between “recollected” and “imagined” stories. There’s a lot in the paper, but the thing that especially caught my eye was that a neural language model (GPT) does better predicting the next sentence in imagined stories than in recollected stories about biographical events. The authors persuasively interpret this as a sign that imagined stories have been streamlined by a process of “narrativization.”

The stories in that article are very short narratives made up (or recalled) by experimental subjects. But, given my background in literary history, I wondered whether the same contrast might appear between book-length works of fiction and biography. Are fictional narratives in some sense more predictable than nonfiction?

One could say we already know the answer. Fiction is governed by plot conventions, so of course it makes sense that it’s predictable! But an equally intuitive argument could be made that fiction entertains readers by baffling and eluding their expectations about what, specifically, will happen next. Perhaps it ought to be less predictable than nonfiction? In short, there are basic questions about fiction that don’t have clear general answers yet, although we’re getting better at framing the questions. (See e.g. Caroline Levine on The Serious Pleasures of Suspense, Vera Tobin on Elements of Surprise, or Andrew Piper’s chapters on “Plot” and “Fictionality” in Enumerations.)

Plus, even if it were intuitively obvious that fiction is more strongly governed by plot conventions than by surprise, it might be interesting to measure the strength of those conventions in particular works. If we could do that, we’d have new evidence for a host of familiar debates about tradition and innovation.

So, how to do it? Sap et al. measure “narrative flow” by using a neural language model that can judge whether a sentence is likely to occur in a given context. It’s a good strategy for paragraph or page-sized stories, but I suspect sentences may be too small to capture the things we would call “predictable plot patterns” in novels. However, it wasn’t hard to give this strategy a spin, so I did, using a language model called BERT to assess pairs of sentences from 32 biographies and 32 novels. (This is just a toy-sized sample for a semi-thought-experiment; I’m not pretending to finally resolve anything.) At each step, in each book, I asked BERT to judge the probability that sentence B would really follow sentence A. (The code I used is in a GitHub repo.)

The result I got was the opposite of the one reported in Sap et al. There is a statistically significant difference between biography and fiction, but the pairs of sentences in biography appeared more predictable—more likely to follow each other—than the sentences in fiction. I hasten to say, however, that this could be wrong in several ways. First, BERT’s perception that two sentences are likely to follow each other correlates strongly with the length of the sentences. Short sentences (like most sentences in dialogue) seem less clearly connected. Since there’s a lot of dialogue in published fiction, BERT might be, in effect, biased against fiction.

Fig. 1. Two different ways of measuring continuity between some sample sentences.

More importantly, sentence-level continuity isn’t necessarily a good measure of surprise in novel-length works. For instance, in fig. 1, you’ll notice that BERT is unruffled when Pride and Prejudice morphs into Flatland. As long as each sentence picks up some discursive cue from the one before, BERT perceives the pairs as plausibly connected. But by the fourth sentence in the chain, Mr Bennet is listening to a lecture from a translucent, blue, four-dimensional being in his sitting room. Human readers would probably be surprised if this happened.

There are ways to generate “sentence embeddings” that might correspond more closely to human surprise. (This is a crowded field, but see for instance Sentence-BERT, Reimers and Gurevych 2019.) Even primitive 2014-era GloVe embeddings do a somewhat better job (Pennington, Socher, and Manning 2014). By averaging the GloVe embeddings for all the words in a sentence, we can represent each sentence as a vector of length 300. Then we can measure the cosine distances between sentences, as I’ve done in the third column of Fig 1. (Here, large numbers indicate a big gap between sentences; it’s the reverse of the “probability” measure provided by BERT, where high numbers represent continuity.) This model of distance is (appropriately) more surprised by the humming blue sphere in row three than by the short sentence of dialogue in row five.

But even if we had a good measure of continuity, sentences might just be too small to capture the patterns that count as “predictability” in a novel. As the example in fig. 1 suggests, a sequence of short steps, individually unsurprising, can leave the reader in a world very different from the place they started. Continuity of this kind is not the “predictability” we would want to measure at book scale.

When readers talk about predictable or unpredictable stories, they’re probably thinking about specific problem situations and possible outcomes. Will the protagonist marry suitor A or suitor B? Can we guess? It may soon be possible to automatically extract implicit questions of this kind from fiction. And the Story Cloze task (Mostafazadeh et al.) showed that it’s possible to answer “what happens next” at paragraph scale. But right now I don’t know how to extract implicit questions, or answer them, at the scale of a novel. So let’s try a simpler—in fact minimal— predictive task. Given two passages selected at random from a book, can we predict which came first? Doing that won’t tell us anything about plot—if “plot” is a causal connection between events. But it will tell us whether book-length works are organized by any predictable large-scale patterns. (As we’ll see in a moment, this is a real question, and in some genres the answer might be “not really.”)

The vector-space representation we developed in the third column of Fig. 1 can be scaled up for this question. “Paragraphs” and “chapters” mean different things in different periods, so for now, it may be better simply to divide stories into arbitrary thousand-word passages. Each passage will be represented as a vector by averaging the GloVe embeddings for the words in it; we’ll subtract one passage from the other and use the difference to decide whether A came before B in the book, or vice-versa.

Fig. 2. Accuracy of sequence prediction for randomly selected pairs of passages from detective novels, or novels randomly selected from the whole Chicago Novel Corpus. Regularized logistic regression is trained on 47 volumes and tested on the 48th; the boxplots represent the range of mean accuracies for different held-out volumes.

Random accuracy for this task would be 50%, but a model trained on a reasonable number of novels can easily achieve 65-66%, especially if the novels are all in the same genre. That number may not sound impressive, but I suspect it’s not much worse than human accuracy would be—if a human reader were asked to draw the arrow of time connecting two random passages from an unfamiliar book.

In fact, why is it possible to do this at all? Since the two passages may be separated by a hundred-odd pages, our model clearly isn’t registering any logical relationships between events. Instead, it’s probably relying on patterns described in previous work by David McClure and Scott Enderle. McClure and Enderle have shown that there are strong linguistic gradients across narrative time in fiction. References to witnesses, guilt, and jail, for instance, tend to occur toward the end of a book (if they occur at all).

Fig. 3. David McClure, “A Hierarchical Cluster of Words Across Narrative Time,” 2017.

Our model may draw even stronger clues from simple shifts of rhetorical perspective like the one in figure 3: indefinite articles appear early in a book, when “a mysterious old man” enters “a room.” A few pages later, he will either acquire a name or become “the old man” in “the room.”

Fig. 4. David William McClure and Scott Enderle, “Distribution of Function Words Across Narrative Time in 50,000 Novels,” ADHO 2018.

We probably wouldn’t call that shift of perspective “plot.” On the other hand, before we dismiss these gradients as merely linguistic rather than narrative phenomena, it’s worth noting that they seem to be specific to fiction. When I try to use the same general strategy to predict the direction of time between pairs of passages in biographies, the model struggles to do better than random guessing. Even with the small toy sample I’m using below (32 novels and 32 biographies), there is clearly a significant difference between the two genres. So, although BERT may not see it, fictional narratives are more predictable than nonfiction ones when we back out to look at the gradient of time across a whole book. There is a much clearer difference between before and after in fiction.

Fig. 5. Range of accuracies for a regularized logistic regression model trained to identify the earlier of two 1000-word passages.

“A predictable difference between before and after” is something a good bit cruder than we ordinarily mean by “plot.” But the fact that this difference is specific to fiction makes me think that a model of this kind may after all confirm some part of what we meant in speculating “fictional plots are shaped by conventions that make them more predictable than nonfiction.”

Of course, to really understand plot, we will need to pair these loose book-sized arcs with a more detailed understanding of the way characters’ actions are connected as we move from one page to the next. For that kind of work, I invite you to survey the actual papers accepted for the Workshop on Narrative Understanding <gestures at the program>, which are advancing the state of the art, for instance, on event extraction.

But I can’t resist pointing out that even the crude vector-space model I have played with here can give us some leverage on page-level surprise, and in doing so, complicate the story I’ve just told. One odd detail I’ve noticed is that the predictability of a narrative at book scale (measured as our ability to predict the direction of time between two widely separated passages) correlates with a kind of unpredictability as we move from one sentence, page, or thousand-word passage to the next.

For instance, one way to describe the stability of a sequence is to measure “autocorrelation.” If we shift a time series relative to itself, moving it back by one step, how much does the original series correlate with the lagged version?

Fig 6. These are wholly imaginary curves to illustrate an idea.

A process with a lot of inertia (e.g., change in temperature across a year) might still have the same basic shape if we shift it backward eight hours. The amount of sunlight in Seattle, on the other hand, fluctuates daily and will be largely out of phase with itself if we shift it backward eight hours; the correlation between those two curves will be pretty low, or even negative as above.

Since we’re representing each passage of a book as a vector of 300 numbers, this gives us 300 time series—300 curves—for each volume. It is difficult to say what each curve represents; the individual components of a word embedding don’t come with interpretable labels. But we can measure the narrative’s general degree of inertia by asking how strongly these curves are, collectively, autocorrelated. Crudely: I shift each time series back one step (1000 words) and measure the Pearson correlation coefficient between the lagged and unlagged version. Then I take the mean correlation for all 300 series.*

Fig 7. Relationship between the volatility of the text (low autocorrelation) and accuracy of models that attempt to put two passages in the right order. Although there are more fiction volumes, we keep accuracy comparable by training on only 32 volumes at a time.

The result is unintuitive. You might think it would be easier to predict the direction of narrative time in books where variables change slowly—as temperature does—tracing a reliable arc. But instead it turns out that prediction is more accurate in books where these curves behave a bit like sunlight, fluctuating substantially every 1000 words. (The linear relationship with autocorrelation is r = -.237 in fig 7, though I suspect the real relationship isn’t linear.) Also, biography appears to be distinguished from fiction by higher autocorrelation (lower volatility).

So yes, fiction is more predictable than nonfiction across the sweep of a whole narrative (because the beginnings and ends of novels are rhetorically very distinct). But the same observation doesn’t necessarily hold as we move from page to page, or sentence to sentence. At that scale, fiction may be more volatile than nonfiction is. I don’t yet know why! We could speculate that this has something to do with an imperative to surprise the reader—but it might also be as simple as the alternation of dialogue and description, which creates a lot of rapid change in the verbal texture of fiction. In short, I’m pointing to a question rather than answering one. There appear to be several different kinds of “predictability” in narrative, and teasing them apart might give us some simple insights into the structural differences between fiction and nonfiction.

  • Postscript: Everything above is speculative and exploratory. I’ve shared some code and data in a repository, but I wouldn’t call it fully replicable. There are more sophisticated ways to measure autocorrelation. If any economists read this, it will occur to them that we could also “predict the future course of a story” using full vector autoregression or an ARIMA model. I’ve tried that, but my sense is that the results were actually dominated by the two factors explored separately above (before-and-after predictability and the autocorrelation of individual variables with themselves). Also, to make any of this really illuminate literary history, we will need a bigger and better corpus, allowing us to ask how patterns like this intersect with genre, prestige, and historical change. A group of researchers at Illinois, including Wenyi Shang and Peizhen Wu, are currently pursuing those questions.

References:

Edwin A. Abbott, Flatland: A Romance of Many Dimensions (London: 1884).

Sanjeev Arora, Yingyu Liang, Tengyu Ma, “A Simple but Tough-to-Beat Baseline for Sentence Embeddings.” ICLR 2017.

Austen, Jane. Pride and Prejudice. London: Egerton, 1813.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding. NAACL 2019.

Caroline Levine, The Serious Pleasures of Suspense (Charlottesville, University of Virginia Press, 2003).

David McClure, “A Hierarchical Cluster of Words Across Narrative Time,” 2017.

Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, James Allen. A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories. NAACL 2016.

Shay Palachy, “Document Embedding Techniques: A Review of Notable Literature on the Topic,” Towards Data Science, September 9, 2019.

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. EMNLP 2014.

Andrew Piper, Enumerations (Chicago: University of Chicago Press, 2018).

Nils Reimers and Iryna Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,” EMNLP-IJCNLP 2019.

Maarten Sap, Eric Horvitz, Yejin Choi, Noah A. Smith, James Pennebaker. Recollection Versus Imagination: Exploring Human Memory and Cognition via Neural Language Models. ACL 2020.

Vera Tobin, Elements of Surprise: Our Mental Limits and the Satisfactions of Plot (Cambridge: Harvard University Press, 2018).

Humanists own the fourth dimension, and we should take pride in it.

This is going to be a short, sweet, slightly-basic blog post, because I just have a simple thing to say.

I was originally trained as a scholar of eighteenth- and nineteenth-century British literature. As I learn more about other disciplines, I have been pleased to find that they are just as self-conscious and theoretically reflective as the one where I was trained. Every discipline has its own kind of theory.

But there is one thing that I still believe the humanities do better than any other part of the university: reflecting on historical change and on the historical mutability of the human mind. Lately social scientists (e.g. economic historians or physical anthropologists) can sometimes give us a run for our money. But humanists are more accustomed to the paradoxes that emerge when the rules of the game you’re playing can get historicized and contextualized, and change right under your feet. (We even have a word for it: “hermeneutics.”) So I think we still basically own the dimension of time.

At the moment, we aren’t celebrating that fact very much. Perhaps we’re still reeling from the late-20th-century discovery that the humanities’ connection to the past can be described as “cultural capital.” Ownership of the collective past is something people fight over, and the humanities had a central position in 19th- and 20th-century education partly because they had the social function of distributing that kind of authority.

Wariness about that social function is legitimate and necessary. However, I don’t think it can negate the basic fact that human beings are shaped and guided by a culture we inherit. In a very literal sense we can’t understand ourselves without understanding the past.

Screen Shot 2019-10-13 at 9.21.58 AM

I don’t think we can afford to play down this link to the past. At a moment when the humanities feel threatened by technological change, it may be tempting to get rid of anything that looks dusty. Out with seventeenth-century books, in with social media and sublimely complex network diagrams. Instead of identifying with the human past, we increasingly justify our fields of study by talking about “humanistic values.” The argument implicit (and sometimes explicit) in that gesture is that the humanities are distinguished from other disciplines not by what we study, but by studying it in a more critical or ethical way.

Maybe that will work. Maybe the world will decide that it needs us because we are the only people preserving ethical reflection in an otherwise fallen age of technology. But I don’t know. There isn’t a lot of evidence that humanists are actually, on average, more ethical than other people. And even if there were good evidence, “I am more critical and ethical than ye” is the kind of claim that often proves a hard sell.

But that’s a big question, and the jury is out. And anyway the humanities don’t need more negativity at the moment. I mainly want to underline a positive point, which is that historical change is a big deal for hominids. Its importance isn’t declining. We are self-programmed creatures, and it is a basic matter of self-respect to try to understand how and when we got our instructions. Humanists are people who try to understand that story, and we should take pride in our connection to time.

Screen Shot 2019-10-13 at 9.56.46 AM

 

Do humanists need BERT?

This blog began as a space where I could tinker with unfamiliar methods. Lately I’ve had less time to do that, because I was finishing a book. But the book is out now—so, back to tinkering!

There are plenty of new methods to explore, because computational linguistics is advancing at a dizzying pace. In this post, I’m going to ask how historical inquiry might be advanced by Transformer-based models of language (like GPT and BERT). These models are handily beating previous benchmarks for natural language understanding. Will they also change historical conclusions based on text analysis? For instance, could BERT help us add information about word order to quantitative models of literary history that previously relied on word frequency? It is a slightly daunting question, because the new methods are not exactly easy to use.

I don’t claim to fully understand the Transformer architecture, although I get a feeling of understanding when I read this plain-spoken post by “nostalgebraist.” In essence Transformers capture information implicit in word order by allowing every word in a sentence—or in a paragraph—to have a relationship to every other word. For a fuller explanation, see the memorably-titled paper “Attention Is All You Need” (Vaswani et al. 2017). BERT is pre-trained on a massive English-language corpus; it learns by trying to predict missing words and put sentences in the right order (Devlin et al., 2018). This gives the model a generalized familiarity with the syntax and semantics of English. Users can then fine-tune the generic model for specific tasks, like answering questions or classifying documents in a particular domain.

scarybert

Credit for meme goes to @Rachellescary.

Even if you have no intention of ever using the model, there is something thrilling about BERT’s ability to reuse the knowledge it gained solving one problem to get a head start on lots of other problems. This approach, called “transfer learning,” brings machine learning closer to learning of the human kind. (We don’t, after all, retrain ourselves from infancy every time we learn a new skill.) But there are also downsides to this sophistication. Frankly, BERT is still a pain for non-specialists to use. To fine-tune the model in a reasonable length of time, you need a GPU, and Macs don’t come with the commonly-supported GPUs. Neural models are also hard to interpret. So there is definitely a danger that BERT will seem arcane to humanists. As I said on Twitter, learning to use it is a bit like “memorizing incantations from a leather-bound tome.”

I’m not above the occasional incantation, but I would like to use BERT only where necessary. Communicating to a wide humanistic audience is more important to me than improving a model by 1%. On the other hand, if there are questions where BERT improves our results enough to produce basically new insights, I think I may want a copy of that tome! This post applies BERT to a couple of different problems, in order to sketch a boundary between situations where neural language understanding really helps, and those where it adds little value.

I won’t walk the reader through the whole process of installing and using BERT, because there are other posts that do it better, and because the details of my own workflow are explained in the github repo. But basically, here’s what you need:

1) A computer with a GPU that supports CUDA (a language for talking to the GPU). I don’t have one, so I’m running all of this on the Illinois Campus Cluster, using machines equipped with a TeslaK40M or K80 (I needed the latter to go up to 512-word segments).

2) The PyTorch module of Python, which includes classes that implement BERT, and translate it into CUDA instructions.

3) The BERT model itself (which is downloaded automatically by PyTorch when you need it). I used the base uncased model, because I wanted to start small; there are larger versions.

4) A few short Python scripts that divide your data into BERT-sized chunks (128 to 512 words) and then ask PyTorch to train and evaluate models. The scripts I’m using come ultimately from HuggingFace; I borrowed them via Thilina Rajapakse, because his simpler versions appeared less intimidating than the original code. But I have to admit: in getting these scripts to do everything I wanted to try, I sometimes had to consult the original HuggingFace code and add back the complexity Rajapakse had taken out.

Overall, this wasn’t terribly painful: getting BERT to work took a couple of days. Dependencies were, of course, the tricky part: you need a version of PyTorch that talks to your version of CUDA. For more details on my workflow (and the code I’m using), you can consult the github repo.

So, how useful is BERT? To start with, let’s consider how it performs on a standard sentiment-analysis task: distinguishing positive and negative opinions in 25,000 movie reviews from IMDb. It takes about thirty minutes to convert the data into BERT format, another thirty to fine-tune BERT on the training data, and a final thirty to evaluate the model on a validation set. The results blow previous benchmarks away. I wrote a casual baseline using logistic regression to make predictions about bags of words; BERT easily outperforms both my model and the more sophisticated model that was offered as state-of-the-art in 2011 by the researchers who developed the IMDb dataset (Maas et al. 2011).

sentiment

Accuracy on the IMDb dataset from Maas et al.; classes are always balanced; the “best BoW” figure is taken from Maas et al.

I suspect it is possible to get even better performance from BERT. This was a first pass with very basic settings: I used the bert-base-uncased model, divided reviews into segments of 128 words each, ran batches of 24 segments at a time, and ran only a single “epoch” of training. All of those choices could be refined.

Note that even with these relatively short texts (the movie reviews average 234 words long), there is a big difference between accuracy on a single 128-word chunk and on the whole review. Longer texts provide more information, and support more accurate modeling. The bag-of-words model can automatically take full advantage of length, treating the whole review as a single, richly specified entity. BERT is limited to a fixed window; when texts are longer than the window, it has to compensate by aggregating predictions about separate chunks (“voting” or averaging them). When I force my bag-of-words model to do the same thing, it loses some accuracy—so we can infer that BERT is also handicapped by the narrowness of its window.

But for sentiment analysis, BERT’s strengths outweigh this handicap. When a review says that a movie is “less interesting than The Favourite,” a bag-of-words model will see “interesting!” and “favorite!” BERT, on the other hand, is capable of registering the negation.

Okay, but this is a task well suited to BERT: modeling a boundary where syntax makes a big difference, in relatively short texts. How does BERT perform on problems more typical of recent work in cultural analytics—say, questions about genre in volume-sized documents?

The answer is that it struggles. It can sometimes equal, but rarely surpass, logistic regression on bags of words. Since I thought BERT would at least equal a bag-of-words model, I was puzzled by this result, and didn’t believe it until I saw the same code working very well on the sentiment-analysis task above.

boxplot

The accuracy of models predicting genre. Boxplots reflect logistic regression on bags of words; we run 30 train/test/validation splits and plot the variation. For BERT, I ran a half-dozen models for each genre and plotted the best result. Small b is accuracy on individual chunks; capital B after aggregating predictions at volume level. All models use 250 volumes evenly drawn from positive and negative classes. BERT settings are usually 512 words / 2 epochs, except for the detective genre, which seemed to perform better at 256/1. More tuning might help there.

Why can’t BERT beat older methods of genre classification? I am not entirely sure yet. I don’t think BERT is simply bad at fiction, because it’s trained on Google Books, and Sims et al. get excellent results using BERT embeddings on fiction at paragraph scale. What I suspect is that models of genre require a different kind of representation—one that emphasizes subtle differences of proportion rather than questions of word sequence, and one that can be scaled up. BERT did much better on all genres when I shifted from 128-word segments to 256- and then 512-word lengths. Conversely, bag-of-words methods also suffer significantly when they’re forced to model genre in a short window: they lose more accuracy than they lost modeling movie reviews, even after aggregating multiple “votes” for each volume.

It seems that genre is expressed more diffusely than the opinions of a movie reviewer. If we chose a single paragraph randomly from a work of fiction, it wouldn’t necessarily be easy for human eyes to categorize it by genre. It is a lovely day in Hertfordshire, and Lady Cholmondeley has invited six guests to dinner. Is this a detective story or a novel of manners? It may remain hard to say for the first twenty pages. It gets easier after her nephew gags, turns purple and goes face-first into the soup course, but even then, we may get pages of apparent small talk in the middle of the book that could have come from a different genre. (Interestingly, BERT performed best on science fiction. This is speculative, but I tend to suspect it’s because the weirdness of SF is more legible locally, at the page level, than is the case for other genres.)

Although it may be legible locally in SF, genre is usually a question about a gestalt, and BERT isn’t designed to trace boundaries between 100,000-word gestalts. Our bag-of-words model may seem primitive, but it actually excels at tracing those boundaries. At the level of a whole book, subtle differences in the relative proportions of words can distinguish detective stories from realist novels with sordid criminal incidents, or from science fiction with noir elements.

I am dwelling on this point because the recent buzz around neural networks has revivified an old prejudice against bag-of-words methods. Dissolving sentences to count words individually doesn’t sound like the way human beings read. So when people are first introduced to this approach, their intuitive response is always to improve it by adding longer phrases, information about sentence structure, and so on. I initially thought that would help; computer scientists initially thought so; everyone does, initially. Researchers have spent the past thirty years trying to improve bags of words by throwing additional features into the bag (Bekkerman and Allan 2003). But these efforts rarely move the needle a great deal, and perhaps now we see why not.

BERT is very good at learning from word order—good enough to make a big difference for questions where word order actually matters. If BERT isn’t much help for classifying long documents, it may be time to conclude that word order just doesn’t cast much light on questions about theme and genre. Maybe genres take shape at a level of generality where it doesn’t really matter whether “Baroness poisoned nephew” or “nephew poisoned Baroness.”

I say “maybe” because this is just a blog post based on one week of tinkering. I tried varying the segment length, batch size, and number of epochs, but I haven’t yet tried the “large” or “cased” pre-trained models. It is also likely that BERT could improve if given further pre-training on fiction. Finally, to really figure out how much BERT can add to existing models of genre, we might try combining it in an ensemble with older methods. If you asked me to bet, though, I would bet that none of those stratagems will dramatically change the outlines of the picture sketched above. We have at this point a lot of evidence that genre classification is a basically different problem from paragraph-level NLP.

Anyway, to return to the question in the title of the post: based on what I have seen so far, I don’t expect Transformer models to displace other forms of text analysis. Transformers are clearly going to be important. They already excel at a wide range of paragraph-level tasks: answering questions about a short passage, recognizing logical relations between sentences, predicting which sentence comes next. Those strengths will matter for classification boundaries where syntax matters (like sentiment). More importantly, they could open up entirely new avenues of research: Sims et al. have been using BERT embeddings for event detection, for instance—implying a new angle of attack on plot.

But volume-scale questions about theme and genre appear to represent a different sort of modeling challenge. I don’t see much evidence that BERT will help there; simpler methods are actually tailored to the nature of this task with a precision we ought to appreciate.

Finally, if you’re on the fence about exploring this topic, it might be shrewd to wait a year or two. I don’t believe Transformer models have to be hard to use; they are hard right now, I suspect, mostly because the technology isn’t mature yet. So you may run into funky issues about dependencies, GPU compatibility, and so on. I would expect some of those kinks to get worked out over time; maybe eventually this will become as easy as “from sklearn import bert”?

References

Bekkerman, Ron, and James Allan. “Using Bigrams in Text Categorization.” 2003. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.152.4885&rep=rep1&type=pdf

Devlin, Jacob, Ming-Wei Chan, Kenton Lee, and Kristina Toutonova. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. 2018. https://arxiv.org/pdf/1810.04805.pdf

HuggingFace. “PyTorch Pretrained BERT: The Big and Extending Repository of Pretrained Transformers.” https://github.com/huggingface/pytorch-pretrained-BERT

Maas, Andrew, et al. “Learning Word Vectors for Sentiment Analysis.” 2011. https://www.aclweb.org/anthology/P11-1015

Rajapakse, Thilina. “A Simple Guide to Using BERT for Binary Text Classification.” 2019. https://medium.com/swlh/a-simple-guide-on-using-bert-for-text-classification-bbf041ac8d04

Sims, Matthew, Jong Ho Park, and David Bamman. “Literary Event Detection.” 2019. http://people.ischool.berkeley.edu/~dbamman/pubs/pdf/acl2019_literary_events.pdf

Underwood, Ted. “The Life Cycles of Genres.” The Journal of Cultural Analytics. 2015. https://culturalanalytics.org/2016/05/the-life-cycles-of-genres/

Vaswani, Ashish, et al. “Attention Is All You Need.” 2017. https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

 

 

 

 

 

 

 

Do topic models warp time?

Recently, historians have been trying to understand cultural change by measuring the “distances” that separate texts, songs, or other cultural artifacts. Where distances are large, they infer that change has been rapid. There are many ways to define distance, but one common strategy begins by topic modeling the evidence. Each novel (or song, or political speech) can be represented as a distribution across topics in the model. Then researchers estimate the pace of change by measuring distances between topic distributions.

In 2015, Mauch et al. used this strategy to measure the pace of change in popular music—arguing, for instance, that changes linked to hip-hop were more dramatic than the British invasion. Last year, Barron et al. used a similar strategy to measure the influence of speakers in French Revolutionary debate.

I don’t think topic modeling causes problems in either of the papers I just mentioned. But these methods are so useful that they’re likely to be widely imitated, and I do want to warn interested people about a couple of pitfalls I’ve encountered along the road.

One reason for skepticism will immediately occur to humanists: are human perceptions about difference even roughly proportional to the “distances” between topic distributions? In one case study I examined, the answer turned out to be “yes,” but there are caveats attached. Read the paper if you’re curious.

In this blog post, I’ll explore a simpler and weirder problem. Unless we’re careful about the way we measure “distance,” topic models can warp time. Time may seem to pass more slowly toward the edges of a long topic model, and more rapidly toward its center.

For instance, suppose we want to understand the pace of change in fiction between 1885 and 1984. To make sure that there is exactly the same amount of evidence in each decade, we might randomly select 750 works in each decade, and reduce each work to 10,000 randomly sampled words. We topic-model this corpus. Now, suppose we measure change across every year in the timeline by calculating the average cosine distance between the two previous years and the next two years. So, for instance, we measure change across the year 1911 by taking each work published in 1909 or 1910, and comparing its topic proportions (individually) to every work published in 1912 or 1913. Then we’ll calculate the average of all those distances. The (real) results of this experiment are shown below.

firstdiscovery

Perhaps we’re excited to discover that the pace of change in fiction peaks around 1930, and declines later in the twentieth century. It fits a theory we have about modernism! Wanting to discover whether the decline continues all the way to the present, we add 25 years more evidence, and create a new topic model covering the century from 1910 to 2009. Then we measure change, once again, by measuring distances between topic distributions. Now we can plot the pace of change measured in two different models. Where they overlap, the two models are covering exactly the same works of fiction. The only difference is that one covers a century (1885-1984) centered at 1935, and the other a century (1910-2009) centered at 1960.

seconddiscovery

But the two models provide significantly different pictures of the period where they overlap. 1978, which was a period of relatively slow change in the first model, is now a peak of rapid change. On the other hand, 1920, which was a point of relatively rapid change, is now a trough of sluggishness.

Puzzled by this sort of evidence, I discussed this problem with Laure Thompson and David Mimno at Cornell, who suggested that I should run a whole series of models using a moving window on the same underlying evidence. So I slid a 100-year window across the two centuries from 1810 to 2009 in five 25-year steps. The results are shown below; I’ve smoothed the curves a little to make the pattern easier to perceive.

timewarp

The models don’t agree with each other well at all. You may also notice that all these curves are loosely n-shaped; they peak at the middle and decline toward the edges (although sometimes to an uneven extent). That’s why 1920 showed rapid change in a model centered at 1935, but became a trough of sloth in one centered at 1960. To make the pattern clearer we can directly superimpose all five models and plot them on an x-axis using date relative to the model’s timeline (instead of absolute date).

rainbow

The pattern is clear: if you measure the pace of change by comparing documents individually, time is going to seem to move faster near the center of the model. I don’t entirely understand why this happens, but I suspect the problem is that topic diversity tends to be higher toward the center of a long timeline. When the modeling process is dividing topics, phenomena at the edges of the timeline may fall just below the threshold to form a distinct topic, because they’re more sparsely represented in the corpus (just by virtue of being near an edge). So phenomena at the center will tend to be described with finer resolution, and distances between pairs of documents will tend to be greater there. (In our conversation about the problem, David Mimno ran a generative simulation that produced loosely similar behavior.)

To confirm that this is the problem, I’ve also measured the average cosine distance, and Kullback-Leibler divergence, between pairs of documents in the same year. You get the same n-shaped pattern seen above. In other words, the problem has nothing to do with rates of change as such; it’s just that all distances tend to be larger toward the center of a topic model than at its edges. The pattern is less clearly n-shaped with KL divergence than with cosine distance, but I’ve seen some evidence that it distorts KL divergence as well.

But don’t panic. First, I doubt this is a problem with topic models that cover less than a decade or two. On a sufficiently short timeline, there may be no systematic difference between topics represented at the center and at the edges. Also, this pitfall is easy to avoid if we’re cautious about the way we measure distance. For instance, in the example above I measured cosine distance between individual pairs of documents across a 5-year period, and then averaged all the distances to create an “average pace of change.” Mathematically, that way of averaging things is slighly sketchy, for reasons Xanda Schofield explained on Twitter:

xanda

The mathematics of cosine distance tend to work better if you average the documents first, and then measure the cosine between the averages (or “centroids”). If you take that approach—producing yearly centroids and comparing the centroids—the five overlapping models actually agree with each other very well.

timeunwarped

Calculating centroids factors out the n-shaped pattern governing average distances between individual books, and focuses on the (smaller) component of distance that is actually year-to-year change. Lines produced this way agree very closely, even about individual years where change seems to accelerate. As substantive literary history, I would take this evidence with a grain of salt: the corpus I’m using is small enough that the apparent peaks could well be produced by accidents of sampling. But the math itself is working.

I’m slightly more confident about the overall decline in the pace of change from the nineteenth century to the twenty-first. Although it doesn’t look huge on this graph, that pattern is statistically quite strong. But I would want to look harder before venturing a literary interpretation. For instance, is this pattern specific to fiction, or does it reflect a broadly shared deceleration in underlying rates of linguistic change? As I argued in a recent paper, supervised models may be better than raw distance measures at answering that culturally-specific question.

But I’m wandering from the topic of this post. The key observation I wanted to share is just that topic models produce a kind of curved space when applied to long timelines; if you’re measuring distances between individual topic distributions, it may not be safe to assume that your yardstick means the same thing at every point in time. This is not a reason for despair: there are lots of good ways to address the distortion. But it’s the kind of thing researchers will want to be aware of.

 

Remarks for a panel on data science in literary studies, in 2028

A transcript of these remarks was sent back via time machine to the Novel Theory conference at Ithaca in 2018, where a panel had been asked to envision what literary studies would look like if data analysis came to be understood as a normal part of the discipline.

I want to congratulate the organizers on their timeliness; 2028 is the perfect time for this retrospective. Even ten years ago I think few of us could have imagined that quantitative methods would become as uncontroversial in literary studies as they are today. (I, myself, might not have believed it.)

But the emergence of a data science option in the undergrad major changed a lot. It is true that the option only exists at a few schools, and most undergrads don’t opt for it. But it has made a few students confident enough to explore further. Today, almost 10% of dissertations in literary studies use numbers in some way.

A tenth of the field is not huge, but even that small foothold has had catalytic effects. One obvious effect has been to give literary studies a warm-water port in the social sciences. Without data science, I’m not sure we would be having the vigorous conversations we’re having today with sociologists, legal scholars, and historians of social media. Increasingly, our fields are bound together by a shared concern with large-scale textual interpretation. That shared conversation, in turn, invites journalists to give literary arguments a different kind of attention. There are examples everywhere. But since we’re all reading it, let me just point to today’s article in The Guardian on the Trump Prison Diaries—which couldn’t have been written ten years ago, for several obvious reasons.

But data analysis has also led to less obvious changes. I think even the recent, widely-discussed return to evaluative criticism—the so-called “new belletrism”—may have had something to do with data.

32768F0400000578-3504667-Most_scientists_agree_that_sea_levels_will_rise_but_some_say_it_-a-3_1458668971088

The conference venue, easily reached by water taxi.

I know this will seem unlikely. These are usually presented as opposing currents in the contemporary scene—one artsy, one mathy; one tending to focus on contemporary literature, the other on the longue durée. But I would argue that quantitative methods have made it easier to treat aesthetic arguments as scholarly questions.

It used to be difficult, after all, to reconcile evaluation with historicism. If you disavowed timeless aesthetic judgments, then it seemed you could only do historical reportage on the peculiar opinions of the 1820s or the 1920s.

Work on the history of reception has created an expansive middle ground between those poles—a way to study judgments that do change, but change in practice very slowly, across century-spanning arcs. Those arcs became visible when we backed up to a distance, but in a sense we can’t get historical distance from them; they sprawl into the present and can’t be relegated to the past. So historical scholars and contemporary critics are increasingly forced onto each other’s turf. Witness the fireworks lately between MFA programs and distant readers arguing that the recent history of the novel is really driven by genre fiction.

Most of these fireworks, I think, are healthy. But there have also been downsides. Ten years ago, none of us imagined that divisions within the data science community could become as deep or bitter as they are today. In a general sense data may be uncontroversial: many literary scholars use, say, a table of sales figures. But machine learning is more controversial than ever.

Ironically, it is often the people most enthusiastic about other forms of data who are rejecting machine learning. And I have to admit they have a point when they call it “subjective.”

It is notoriously true, after all, that learning algorithms absorb the biases implicit in the evidence you give them. So in choosing evidence for our models we are, in a sense, choosing a historical vantage point. Some people see this as appropriate for an interpretive discipline. “We have always known that interpretation was circular,” they say, “and it’s healthy to acknowledge that our inferences start from a situated perspective” (Rosencrantz, 2025). Other people worry that the subjectivity of machine learning is troubling, because it “hides inside a black box” (Guildenstern, 2027). I don’t think the debate is going away soon; I fear literary theorists will still be arguing about it when we meet again in 2038.

New methods need a new kind of conversation

Over the last decade, the (small) fraction of articles in the humanities that use numbers has slowly grown. This is happening partly because computational methods are becoming flexible enough to represent a wider range of humanistic evidence. We can model concepts and social practices, for instance, instead of just counting people and things.

That’s exciting, but flexibility also makes arguments complex and hard to review. Journal editors in the humanities may not have a long list of reviewers who can evaluate statistical models. So while quantitative articles certainly encounter some resistance, they don’t always get the kind of detailed resistance they need. I thought it might be useful to stir up conversation on this topic with a few suggestions, aimed less at the DH community than at the broader community of editors and reviewers in the humanities. I’ll start with proposals where I think there’s consensus, and get more opinionated as I go along.

1. Ask to see code and data.

Getting an informed reviewer is a great first step. But to be honest, there’s not a lot of consensus yet about many methodological questions in the humanities. What we need is less strict gatekeeping than transparent debate.

As computational methods spread in the sciences, scientists have realized that it’s impossible to discuss this work fruitfully if you can’t see how the work was done.  Journals like Cultural Analytics reflect this emerging consensus with policies that require authors to share code and data. But mainstream humanities journals don’t usually have a policy in place yet.

Three or four years ago, confusion on this topic was understandable. But in 2018, journals that accept quantitative evidence at all need a policy that requires authors to share code and data when they submit an article for review, and to make it public when the article is published.

I don’t think the details of that policy matter deeply. There are lots of different ways to archive code and data; they are all okay. Special cases and quibbles can be accomodated. For instance, texts covered by copyright (or other forms of IP) need not be shared in their original form. Derived data can be shared instead; that’s usually fine. (Ideally one might also share the code used to derive it.)

2. … especially code.

Humanists are usually skeptical enough about the data underpinning an argument, because decades of debate about canons have trained us to pose questions about the works an author chooses to discuss.

But we haven’t been trained to pose questions about the magnitude of a pattern, or the degree of uncertainty surrounding it. These aspects of a mathematical argument often deserve more discussion than an author initially provides, and to discuss them, we’re going to need to see the code.

I don’t think we should expect code to be polished, or to run easily on any machine. Writing an article doesn’t commit the author to produce an elegant software tool. (In fact, to be blunt, “it’s okay for academic software to suck.”) The author just needs to document what they did, and the best way to do that is to share the code and data they actually used, warts and all.

3. Reproducibility is great, but replication is the real point.

Ideally, the code and data supporting an article should permit a reader to reproduce all the stages of analysis the author(s) originally performed. When this is true, we say the research is “reproducible.”

But there are often rough spots in reproducibility. Stochastic processes may not run exactly the same way each time, for instance.

At this point, people who study reproducibility professionally will crowd forward and offer an eleven-point plan for addressing all rough spots. (“You just set the random number seed so it’s predictable …”)

That’s wonderful, if we really want to polish a system that allows a reader to push a button and get the same result as the original researcher, to the seventh decimal place. But in the humanities, we’re not always at the “polishing” stage of inquiry yet. Often, our question is more like “could this conceivably work? and if so, would it matter?”

In short, I think we shouldn’t let the imperative to share code foster a premature perfectionism. Our ultimate goal is not to prove that you get exactly the same result as the author if you use exactly the same assumptions and the same books. It’s to decide whether the experiment is revealing anything meaningful about the human past. And to decide that, we probably want to repeat the author’s question using different assumptions and a different sample of books.

When we do that, we are not reproducing the argument but replicating it. (See Language Log for a fuller discussion of the difference.) Replication is the real prize in most cases; that’s how knowledge advances. So the point of sharing code and data is often less to stabilize the results of your own work to the seventh decimal place, and more to guide investigators who may want to undertake parallel inquiries. (For instance, Jonathan Goodwin borrowed some of my code to pose a parallel question about Darko Suvin’s model of science fiction.)

I admit this is personal opinion. But I stress replication over reproducibility because it has some implications for the spirit of the whole endeavor. Since people often imagine that quantitative problems have a right answer, we may initially imagine that the point of sharing code and data is simply to catch mistakes.

In my view the point is rather to permit a (mathematical) conversation about the interpretation of the human past. I hope authors and readers will understand themselves as delayed collaborators, working together to explore different options. What if we did X differently? What if we tried a different sample of books? Usually neither sample is wrong, and neither is right. The point is to understand how much different interpretive assumptions do or don’t change our conclusions. In a sense no single article can answer that question “correctly”; it’s a question that has to be solved collectively, by returning to questions and adjusting the way we frame them. The real point of code-sharing is to permit that kind of delayed collaboration.

 

A broader purpose

The weather prevents me from being there physically, but this is a transcript of my remarks for “Varieties of Digital Humanities,” MLA, Jan 5, 2018.

Using numbers to understand cultural history is often called “cultural analytics”—or sometimes, if we’re talking about literary history in particular, “distant reading.” The practice is older than either name: sociologists, linguists, and adventurous critics like Janice Radway have been using quantitative methods for a long time.

But over the last twenty years, numbers have begun to have a broader impact on literary study, because we’ve learned to use them in a wider range of ways. We no longer just count things that happen to be easily counted (individual words, for instance, or books sold). Instead scholars can start with literary questions that really interest readers, and find ways to model them. Recent projects have cast light, for instance, on the visual impact of poetry, on imagined geography in the novel, on the instability of gender, and on the global diffusion of stream of consciousness. Articles that use numbers are appearing in central disciplinary venues: MLQ, Critical Inquiry, PMLA. Equally important: a new journal called Cultural Analytics has set high standards for transparent and reproducible research.

Of course, scholars still disagree with each other. And that’s part of what makes this field exciting. We aren’t simply piling up facts. New methods are sparking debate about the nature of the knowledge literary historians aim to produce. Are we interpreting the past or explaining it? Can numbers address perspectival questions? The name for these debates is “critical theory.” Twenty years from now, I think it will be clear that questions about quantitative models form an important unit in undergraduate theory courses.

Literary scholars are used to imagining numbers as tools, not as theories. So there’s translation work to be done. But translating between theoretical traditions could be the most important part of this project. Our existing tradition of critical theory teaches students to ask indispensable questions—about power, for instance, and the material basis of ideology. But persuasive answers to those questions will often require a lot of evidence, and the art of extracting meaningful patterns from evidence is taught by a different theoretical tradition, called “statistics.” Students will be best prepared for the twenty-first century if they can connect the two traditions, and do critical theory with numbers.

So in a lot of ways, this is a heady moment. Cultural analytics has historical discoveries, lively theoretical debates, and a public educational purpose. Intellectually, we’re in good shape.

But institutionally, we’re in awful shape. Or to be blunt: we are shape-less. Most literature departments do not teach students how to do this stuff at all. Everything I’ve just discussed may be represented by one unit in one course, where students play with topic models. Reduced to that size, I’m not sure cultural analytics makes any sense. If we were seriously trying to teach students to do critical theory with numbers, we would need to create a sequence of courses that guides them through basic principles (of statistical inference as well as historical interpretation) toward projects where they can pose real questions about the past.

What keeps us from building that curriculum? Part of the obstacle, I think, is the term digital humanities itself. Don’t get me wrong: I’m grateful for the popularity of DH. It has lent energy to many different projects. But the term digital humanities has been popular precisely because it promises that all those projects can still be contained in the humanities. The implicit pitch is something like this: “You won’t need a whole statistics course. Come to our two-hour workshop on topic models instead. You can always find a statistician to collaborate with.”

I understand why digital humanists said that kind of thing eight years ago. We didn’t want to frighten people away. If you write “Learn Another Discipline” on your welcome mat, you may not get many visitors. But a deceptively gentle welcome mat, followed by a trapdoor, is not really more welcoming. So it’s time to be honest about the preparation needed for cultural analytics. Young people entering this field will need to understand the whole process. They won’t even be able to pose meaningful questions, for instance, without some statistics.

Trompe l'oeil door mural

Trompe l’oeil faux door mural from http://www.bumblebee murals.com/cool-wall-murals/

But the metaphor of a welcome mat may be too optimistic. This field doesn’t have a door yet. I mean, there is no curriculum. So of course the field tends to attract people who already have an extracurricular background—which, of course, is not equally distributed. It shouldn’t surprise us that access is a problem when this field only exists as a social network. The point of a classroom is to distribute knowledge in a more equal, less homosocial way. But digital humanities classes, as currently defined, don’t really teach students how to use numbers. (For a bracingly honest exploration of the problem, see Andrew Goldstone.) So it’s almost naive to discuss “barriers to entry.” There is no entrance to this field. What we have is more like a door painted on the wall. But we’re in denial about that—because to admit the problem, we would have to admit that “DH” isn’t working as a gateway to everything it claims to contain.

I think the courses that can really open doors to cultural analytics are found, right now, in the social sciences. That’s why I recently moved half of my teaching to a School of Information Sciences. There, you find a curricular path that covers statistics and programming along with social questions about technology. I don’t think it’s an accident that you also find better gender and ethnic diversity among people using numbers in the social sciences. Methods get distributed more equally within a discipline that actually teaches the methods. So I recommend fusing cultural analytics with social science partly because it immediately makes this field more diverse. I’m not offering that as a sufficient answer to problems of access. I welcome other answers too. But I am suggesting that social-scientific methods are a necessary part of access. We cannot lower barriers to entry by continuing to pretend that cultural analytics is just the humanities, plus some user-friendly digital tools. That amounts to a trompe-l’oeil door.

What the social sciences lack are courses in literary history. And that’s important, because distant readers set out to answer concrete historical questions. So the unfortunate reality is, this project cannot be contained in one discipline.  The questions we try to answer are taught in the humanities. But the methods we use are taught, right now, in the social sciences and data science. Even if it frightens some students off, we have to acknowledge that cultural analytics is a multi-disciplinary project—a bridge between the humanities and quantitative social science, belonging equally to both.

I’m not recommending this approach for the DH community as a whole. DH has succeeded by fitting into the institutional framework of the humanities. DH courses are often pitched to English or History majors, and for many topics, that works brilliantly. But it’s awkward for quantitative courses. To use numbers wisely, students need preparation that an English major doesn’t provide. So increasingly I see the quantitative parts of DH presented as an interdisciplinary program rather than a concentration in the humanities.

dooropenIn saying this, I don’t mean to undersell the value of numbers for humanists. New methods can profoundly transform our view of the human past, and the research is deeply rewarding. So I’m convinced that statistics, and even machine learning, will gradually acquire a place in the humanistic curriculum.

I’m just saying that this is a bigger, slower project than the rhetoric of DH may have led us to expect. Mathematics doesn’t really come packaged in digital tools. Math is a way of thinking, and using it means entering into a long-term relationship with statisticians and social scientists. We are not borrowing tools for private use inside our discipline, but starting a theoretical conversation that should turn us outward, toward new forms of engagement with our colleagues and the world.

What is the point of studying culture with numbers, after all? It’s not to change English departments, but to enrich the way all students think about culture. The questions we’re posing can have real implications for the way students understand their roles in history—for instance, by linking their own cultural experience to century-spanning trends. Even more urgently, these questions give students a way to connect interpretive insights and resonant human details with habits of experimental inquiry.

Instead of imagining cultural analytics as a subfield of DH, I would almost call it an emerging way to integrate the different aspects of a liberal education. People who want to tackle that challenge are going to have to work across departments to some extent: it’s not a project that an English department could contain. But it is nevertheless an important opportunity for literary scholars, since it’s a place where our work becomes central to the broader purposes of the university as a whole.

It looks like you’re writing an argument against data in literary study …

would you like some help with that?

I’m not being snarky. Right now, I have several friends writing articles that are largely or partly a critique of interrelated trends that go under the names “data” or “distant reading.” It looks like many other articles of the same kind are being written.

This is good news! I believe fervently in Mae West’s theory of publicity. “I don’t care what the newspapers say about me as long as they spell my name right.” (Though it turns out we may not actually know who said that, so I guess the newspapers failed.)

In any case, this blog post is not going to try to stop you from proving that numbers are neoliberal, unethical, inevitably assert objectivity, aim to eliminate all close reading from literary study, fail to represent time, and lead to loss of “cultural authority.” Go for it! Ideas live on critique.

But I do want to help you “spell our names right.” Andrew Piper has recently pointed out that critiques of data-driven research tend to use a small sample of articles. He expressed that more strongly, but I happen to like the article he was aiming at, so I’m going to soften his expression. However, I don’t disagree with the underlying point! For some reason, critics of numbers don’t feel they need to consider more than one example, or two if they’re in a generous mood.

There are some admirable exceptions to this rule. I’ve argued that a recent issue of Genre was, in general, moving in the right direction. And I’m fairly confident that the trend will continue. A field that has been generating mostly articles and pamphlets is about to shift into a lower gear and publish several books. In literary studies, that tends to be an effective way of reframing debate.

But it may be another twelve to eighteen months before those books are out. In the meantime, you’ve got to finish your critique. So let me help with the bibliography.

When you’re tempted to assume that all possible uses of numbers (or “data”) in literary study can be summed up by engaging one or two texts that Franco Moretti wrote in the year 2000, you should resist the assumption. You are actually talking about a long, complex story, and your readers deserve some glimpse of its complexity.

For instance, sociologists, linguists and book historians have been using numbers to describe literature since the middle of the twentieth century. You should make clear whether you are critiquing that work, or just arguing that it is incapable of addressing the inner literariness of literature. The journal Computers and the Humanities started in the 1960s. The 1980s gave rise to a thriving tradition of feminist literary sociology, embodied in books by Janice Radway and Gaye Tuchman, and in the journal Signs. I’ve used one of Tuchman’s regression models as an illustration here.

tuchman

Variables predicting literary fame in a regression model, from Gaye Tuchman and Nina E. Fortin, Edging Women Out (1989).

<deep breath>

In the 1990s, Mark Olsen (working at the University of Chicago) started to articulate many of the impulses we now call “distant reading.” Around 2000, Franco Moretti gave quantitative approaches an infusion of polemical verve and wit, which raised their profile among literary scholars who had not previously paid attention. (Also, frankly, the fact that Moretti already had disciplinary authority to spend mattered a great deal. Literary scholars can be temperamentally conservative even when theoretically radical.)

But Moretti himself is a moving target. The articles he has written since 2008 aim at different goals, and use different methods, than articles before that date. Part of the point of an experimental method, after all, is that you are forced to revise your assumptions! Because we are actually learning things, this field is changing rapidly. A recent pamphlet from the Stanford Literary Lab conceives the role of the “archive,” for instance, very differently than “Slaughterhouse of Literature” did.

But that pamphlet was written by six authors—a useful reminder that this is a collective project. Today the phrase “distant reading” is often a loose description for large-scale literary history, covering many people who disagree significantly with Moretti. In a recent roundtable in PMLA, for instance, Andrew Goldstone argues for evidence of a more sociological and less linguistic kind. Lisa Rhody and Alison Booth both argue for different scales or forms of “distance.” Richard Jean So argues that the simple measurements which typified much work before 2010 need to be replaced by statistical models, which account for variation and uncertainty in a more principled way.

One might also point, for instance, to Lauren Klein’s work on gaps in the archive, or to Ryan Cordell’s work on literary circulation, or to Katherine Bode’s work, which aims to construct corpora that represent literary circulation rather than production. Or to Matt Wilkens, or Hoyt Long, or Tanya Clement, or Matt Jockers, or James F. English … I’m going to run out of breath before I run out of examples.

Not all of these scholars believe that numbers will put literary scholarship on a more objective footing. Few of them believe that numbers can replace “interpretation” with “explanation.” None of them, as far as I can tell, have stopped doing close reading. (I would even claim to pair numbers with close reading in Joseph North’s strong sense of the phrase: not just reading-to-illustrate-a-point but reading-for-aesthetic-cultivation.) In short, the work literary scholars are doing with numbers is not easily unified by a shared set of principles—and definitely isn’t unified by a 17-year-old polemic. The field is unified, rather, by a fast-moving theoretical debate. Literary production versus circulation. Book history versus social science. Sociology versus linguistics. Measurement versus modeling. Interpretation versus explanation versus prediction.

Critics of this work may want to argue that it all nevertheless fails in the same way, because numbers inevitably (flatten time/reduce reading to visualization/exclude subjectivity/fill in the blank). That’s a fair thesis to pursue. But if you believe that, you need to show that your generalization is true by considering several different (recent!) examples, and teasing out the tacit similarities concealed underneath ostensible disagreements. I hope this post has helped with some of the bibliographic legwork. If you want more sources, I recently wrote a “Genealogy of Distant Reading” that will provide more. Now, tear them apart!

We’re probably due for another discussion of Stanley Fish

I think I see an interesting theoretical debate over the horizon. The debate is too big to resolve in a blog post, but I thought it might be narratively useful to foreshadow it—sort of as novelists create suspense by dropping hints about the character traits that will develop into conflict by the end of the book.

Basically, the problem is that scholars who use numbers to understand literary history have moved on from Stanley Fish’s critique, without much agreement about why or how. In the early 1970s, Fish gave a talk at the English Institute that defined a crucial problem for linguistic analysis of literature. Later published as “What Is Stylistics, and Why are They Saying Such Terrible Things About It?”, the essay focused on “the absence of any constraint” governing the move “from description to interpretation.” Fish takes Louis Milic’s discussion of Jonathan Swift’s “habit of piling up words in series” as an example. Having demonstrated that Swift does this, Milic concludes that the habit “argues a fertile and well stocked mind.” But Fish asks how we can make that sort of inference, generally, about any linguistic pattern. How do we know that reliance on series demonstrates a “well stocked mind” rather than, say, “an anal-retentive personality”?

The problem is that isolating linguistic details for analysis also removes them from the context we normally use to give them a literary interpretation. We know what the exclamation “Sad!” implies, when we see it at the end of a Trumpian tweet. But if you tell me abstractly that writer A used “sad” more than writer B, I can’t necessarily tell you what it implies about either writer. If I try to find an answer by squinting at word lists, I’ll often make up something arbitrary. Word lists aren’t self-interpreting.

Thirty years passed; the internet got invented. In the excitement, dusty critiques from the 1970s got buried. But Fish’s argument was never actually killed, and if you listen to the squeaks of bats, you hear rumors that it still walks at night.

Or you could listen to blogs. This post is partly prompted by a blogged excerpt from a forthcoming work by Dennis Tenen, which quotes Fish to warn contemporary digital humanists that “a relation can always be found between any number of low-level, formal features of a text and a given high-level account of its meaning.” Without “explanatory frameworks,” we won’t know which of those relations are meaningful.

Ryan Cordell’s recent reflections on “machine objectivity” could lead us in a similar direction. At least they lead me in that direction, because I think the error Cordell discusses—over-reliance on machines themselves to ground analysis—often comes from a misguided attempt to solve the problem of arbitrariness exposed by Fish. Researchers are attracted to unsupervised methods like topic modeling in part because those methods seem to generate analytic categories that are entirely untainted by arbitrary human choices. But as Fish explained, you can’t escape making choices. (Should I label this topic “sadness” or “Presidential put-downs”?)

I don’t think any of these dilemmas are unresolvable. Although Fish’s critique identified a real problem, there are lots of valid solutions to it, and today I think most published research is solving the problem reasonably well. But how? Did something happen since the 1970s that made a difference? There are different opinions here, and the issues at stake are complex enough that it could take decades of conversation to work through them. Here I just want to sketch a few directions the conversation could go.

Dennis Tenen’s recent post implies that the underlying problem is that our models of form lack causal, explanatory force. “We must not mistake mere extrapolation for an account of deep causes and effects.” I don’t think he takes this conclusion quite to the point of arguing that predictive models should be avoided, but he definitely wants to recommend that mere prediction should be supplemented by explanatory inference. And to that extent, I agree—although, as I’ll say in a moment, I have a different diagnosis of the underlying problem.

It may also be worth reviewing Fish’s solution to his own dilemma in “What Is Stylistics,” which was that interpretive arguments need to be anchored in specific “interpretive acts” (93). That has always been a good idea. David Robinson’s analysis of Trump tweets identifies certain words (“badly,” “crazy”) as signs that a tweet was written by Trump, and others (“tomorrow,” “join”) as signs that it was written by his staff. But he also quotes whole tweets, so you can see how words are used in context, make your own interpretive judgment, and come to a better understanding of the model. There are many similar gestures in Stanford LitLab pamphlets: distant readers actually rely quite heavily on close reading.

My understanding of this problem has been shaped by a slightly later Fish essay, “Interpreting the Variorum” (1976), which returns to the problem broached in “What Is Stylistics,” but resolves it in a more social way. Fish concludes that interpretation is anchored not just in an individual reader’s acts of interpretation, but in “interpretive communities.” Here, I suspect, he is rediscovering an older hermeneutic insight, which is that human acts acquire meaning from the context of human history itself. So the interpretation of culture inevitably has a circular character.

One lesson I draw is simply, that we shouldn’t work too hard to avoid making assumptions. Most of the time we do a decent job of connecting meaning to an implicit or explicit interpretive community. Pointing to examples, using word lists derived from a historical thesaurus or sentiment dictionary—all of that can work well enough. The really dubious moves we make often come from trying to escape circularity altogether, in order to achieve what Alan Liu has called “tabula rasa interpretation.”

But we can also make quantitative methods more explicit about their grounding in interpretive communities. Lauren Klein’s discussion of the TOME interface she constructed with Jacob Eisenstein is a good model here; Klein suggests that we can understand topic modeling better by dividing a corpus into subsets of documents (say, articles from different newspapers), to see how a topic varies across human contexts.

Of course, if you pursue that approach systematically enough, it will lead you away from topic modeling toward methods that rely more explicitly on human judgment. I have been leaning on supervised algorithms a lot lately—not because they’re easier to test or more reliable than unsupervised ones—but because they explicitly acknowledge that interpretation has to be anchored in human history.

At a first glance, this may seem to make progress impossible. “All we can ever discover is which books resemble these other books selected by a particular group of readers. The algorithm can only reproduce a category someone else already defined!” And yes, supervised modeling is circular. But this is a circularity shared by all interpretation of history, and it never merely reproduces its starting point. You can discover that books resemble each other to different degrees. You can discover that models defined by the responses of one interpretive community do or don’t align with models of another. And often you can, carefully, provisionally, draw explanatory inferences from the model itself, assisted perhaps by a bit of close reading.

I’m not trying to diss unsupervised methods here. Actually, unsupervised methods are based on clear, principled assumptions. And a topic model is already a lot more contextually grounded than “use of series == well stocked mind.” I’m just saying that the hermeneutic circle is a little slipperier in unsupervised learning, easier to misunderstand, and harder to defend to crowds of pitchfork-wielding skeptics.

In short, there are lots of good responses to Fish’s critique. But if that critique is going to be revived by skeptics over the next few years—as I suspect—I think I’ll take my stand for the moment on supervised machine learning, which can explicitly build bridges between details of literary language and social contexts of reception.  There are other ways to describe best practices: we could emphasize a need to seek “explanations,” or avoid claims of “objectivity.” But I think the crucial advance we have made over the 1970s is that we’re no longer just modeling language; we can model interpretive communities at the same time.

Photo credit: A school of yellow-tailed goatfish, photo for NOAA Photo Library, CC-BY Dwayne Meadows, 2004.

Postscript July 15: Jonathan Armoza points out that Stephen Ramsay wrote a post articulating his own, more deformative response to “What is Stylistics” in 2012.

Finding the great divide

Last year, Jordan Sellers and I published an article in Modern Language Quarterly, trying to trace the “great divide” that is supposed to open up between mass culture and advanced literary taste around the beginning of the twentieth century.

I’m borrowing the phrase “great divide” from Andreas Huyssen, but he’s not the only person to describe the phenomenon. Whether we explain it neutrally as a consequence of widespread literacy, or more skeptically as the rise of a “culture industry,” literary historians widely agree that popularity and prestige parted company in the twentieth century. So we were surprised not to be able to measure the widening gap.

expectedWe could certainly model literary taste. We trained a model to distinguish poets reviewed in elite literary magazines from a less celebrated “contrast group” selected randomly. The model achieved roughly 79% accuracy, 1820-1919,  and the stability of the model itself raised interesting questions. But we didn’t find that the model’s accuracy increased across time in the way we would have expected in a period when elite and popular literary taste are specializing and growing apart.

Instead of concluding that the division never happened, we guessed that we had misunderstood it or looked in the wrong place. Algee-Hewitt and McGurl have pretty decisively confirmed that a divide exists in the twentieth century. So we ought to be able to see it emerging. Maybe we needed to reach further into the twentieth century — or maybe we would have better luck with fiction, since the history of fiction provides evidence about sales, as well as prestige?

In fact, getting evidence about that second, economic axis seems to be the key. It took work by many hands over a couple of years: Kyle Johnston, Sabrina Lee, and Jessica Mercado, as well as Jordan Sellers, have all contributed to this project. I’m presenting a preliminary account of our results at Cultural Analytics 2017, and this blog post is just a brief summary of the main point.

When you look at the books described as bestsellers by Publisher’s Weekly, or by book historians (see references to Altick, Bloom, Hackett, Leavis, below) it’s easy to see the two circles of the Venn diagram pulling apart: on the one hand bestsellers, on the other hand books reviewed in elite venues. (For our definition of “elite venues” see the “Table” in a supporting code & data repository.)

authorfractions2

On the other hand, when you back up from bestsellers to look at a broader sample of literary production, it’s still not easy to detect increasing stylistic differentiation between the elite “reviewed” texts and the rest of the literary field. A classifier trained on the reviewed fiction has roughly 72.5% accuracy from 1850 to 1949; if you break the century into parts, there are some variations in accuracy, but no consistent pattern. (In a subsequent blog post, I’ll look at the fiddly details of algorithm choice and feature engineering, but the long and short of that question is — it doesn’t make a significant difference.)

To understand why the growing separation of bestsellers from “reviewed” texts at the high end of the market doesn’t seem to make literary production as a whole more strongly stratified, I’ve tried mapping authors onto a two-dimensional model of the literary field, intended to echo Pierre Bourdieu’s well-known diagrams of the interaction between economic and cultural distinction.

bourdieufield

Pierre Bourdieu, The Field of Cultural Production (1993), p. 49.

In the diagram below, for instance, the horizontal axis represents sales, and the vertical axis represents prestige. Sales would be easy to measure, if we had all the data. We actually don’t — so see the end of this post for the estimation strategy I adopted. Prestige, on the other hand, is difficult to measure: it’s perspectival and complex. So we modeled prestige by sampling texts that were reviewed in prominent literary magazines, and then training a model that used textual cues to predict the probability that any given book came from the “reviewed” set. An author’s prestige in this diagram is simply the average probability of review for their books. (The Stanford Literary Lab has similarly recreated Bourdieu’s model of distinction in their pamphlet “Canon/Archive,” using academic citations as a measure of prestige.)

field1850noline

The upward drift of these points reveals a fairly strong correlation between prestige and sales. It is possible to find a few high-selling authors who are predicted to lack critical prestige — notably, for instance, the historical novelist W. H. Ainsworth and the sensation novelist Ellen Wood, author of East Lynne. It’s harder to find authors who have prestige but no sales: there’s not much in the northwest corner of the map. Arthur Helps, a Cambridge Apostle, is a fairly lonely figure.

Fast-forward seventy-five years and we see a different picture.

field1925noline

The correlation between sales and prestige is now weaker; the cloud of authors is “rounder” overall.

There are also more authors in the “upper midwest” portion of the map now — people like Zora Neale Hurston and James Joyce, who have critical prestige but not enormous sales (or not up to 1949, at least as far as my model is aware).

There’s also a distinct “genre fiction” and “pulp fiction” world emerging in the southeast corner of this map, ranging from Agatha Christie to Mickey Spillane. (A few years earlier, Edgar Rice Burroughs and Zane Gray are in the same region.)

Moreover, if you just look at the large circles (the authors we’re most likely to remember), you can start to see how people in this period might get the idea that sales are actually negatively correlated with critical prestige. The right side of the map almost looks like a diagonal line slanting down from William Faulkner to P. G. Wodehouse.

That negative correlation doesn’t really characterize the field as a whole. Critical prestige still has a faint positive correlation with sales, as people over on the left side of the map might sadly remind us. But a brief survey of familiar names could give you the opposite impression.

In short, we’re not necessarily seeing a stronger stratification of the literary field. The change might better be described as a decline in the correlation of two existing forms of distinction. And as they become less correlated, the difference between them becomes more visible, especially among the well-known names on the right side of the map.

summary

So, while we’re broadly confirming an existing story about literary history, the evidence also suggests that the metaphor of a “great divide” is a bit of an exaggeration. We don’t see any chasm emerging.

Maps of the literary field also help me understand why a classifier trained on an elite “reviewed” sample didn’t necessarily get stronger over time. The correlation of prestige and sales in the Victorian era means that the line separating the red and blue samples was strongly tilted there, and may borrow some of its strength from both axes. (It’s really a boundary between the prominent and the obscure.)

cheating

As we move into the twentieth century, the slope of the line gets flatter, and we get closer to a “pure” model of prestige (as distinguished from sales). But the boundary itself may not grow more clearly marked, if you’re sampling a group of the same size. (However, if you leave The New Republic and New Yorker behind, and sample only works reviewed in little magazines, you do get a more tightly unified group of texts that can be distinguished from a random sample with 83% accuracy.)

This is all great, you say — but how exactly are you “estimating” sales? We don’t actually have good sales figures for every author in HathiTrust Digital Library; we have fairly patchy records that depend on individual publishers.
bayesiantransform
For the answer to that question, I’m going to refer you to the github repo where I work out a model of sales. The short version is that I borrow a version of “empirical Bayes” from Julia Silge and David Robinson, and apply it to evidence drawn from bestseller lists as well as digital libraries, to construct a rough estimate of each author’s relative prominence in the market. The trick is, basically, to use the evidence we have to construct an estimate of our uncertainty, and then use our uncertainty to revise the evidence. The picture on the left gives you a rough sense of how that transformation works. I think empirical Bayes may turn out to be useful for a lot of problems where historians need to reconstruct evidence that is patchy or missing in the historical record, but the details are too much to explain here; see Silge’s post and my Jupyter notebook.

Bubble charts invite mouse-over exploration. I can’t easily embed interactive viz in this blog, but here are a few links to plotly visualizations:

https://plot.ly/~TedUnderwood/8/the-literary-field-1850-1874/

https://plot.ly/~TedUnderwood/4/the-literary-field-1875-1899/

https://plot.ly/~TedUnderwood/2/the-literary-field-1900-1924/

https://plot.ly/~TedUnderwood/6/the-literary-field-1925-1949/

Acknowledgments

The texts used here are drawn from HathiTrust via the HathiTrust Research Center. Parts of the research were funded by the Andrew G Mellon Foundation via the WCSA+DC grant, and part by SSHRC via NovelTM.

Most importantly, I want to acknowledge my collaborators on this project, Kyle Johnston, Sabrina Lee, Jessica Mercado, and Jordan Sellers. They contributed a lot of intellectual depth to the project — for instance by doing research that helped us decide which periodicals should represent a given period of literary history.

References

Algee-Hewitt, Mark, and Mark McGurl. “Between Canon and Corpus: Six Perspectives on 20th-Century Novels.” Stanford Literary Lab, Pamphlet 9, 2015. https://litlab.stanford.edu/LiteraryLabPamphlet8.pdf

Algee-Hewitt, Mark, Sarah Allison, Marissa Gemma, Ryan Heuser, Franco Moretti, Hannah Walser. “Canon/Archive: Large-Scale Dynamics in the Literary Field.” Stanford Literary Lab, January 2016. https://litlab.stanford.edu/LiteraryLabPamphlet11.pdf

Altick, Richard D. The English Common Reader: A Social History of the Mass Reading Public 1800-1900. Chicago: University of Chicago Press, 1957.

Bloom, Clive. Bestsellers: Popular Fiction Since 1900. 2nd edition. Houndmills: Palgrave Macmillan, 2008.

Hackett, Alice Payne, and James Henry Burke. 80 Years of Best Sellers 1895-1975. New York: R.R. Bowker, 1977.

Leavis, Q. D. Fiction and the Reading Public. 1935.

Mott, Frank Luther. Golden Multitudes: The Story of Bestsellers in the United States. New York: R. R. Bowker, 1947.

Robinson, David. Introduction to Empirical Bayes: Examples from Baseball Statistics. 2017. http://varianceexplained.org/r/empirical-bayes-book/

Silge, Julia. “Singing the Bayesian Beginner Blues.” data science ish, September 2016. http://juliasilge.com/blog/Bayesian-Blues/

Unsworth, John. 20th Century American Bestsellers. (http://bestsellers.lib.virginia.edu)