Category: 19c

The differentiation of literary and nonliterary diction, 1700-1900.

Post author By tedunderwood
Post date February 26, 2012
29 Comments on The differentiation of literary and nonliterary diction, 1700-1900.

When you stumble on an interesting problem, the question arises: do you blog the problem itself — or wait until you have a full solution to publish as an article?

In this case, I think the problem is too big to be solved by a single person anyway, so I might as well get it out there where we can all chip away at it. At the end of this post, I include a link to a page where you can also download the data and code I’m using.

When we compare groups of texts, we’re often interested in characterizing the contrast between them. But instead of characterizing the contrast, you could also just measure the distance between categories. For instance, you could generate a list of word frequencies for two genres, and then run a Spearman’s correlation test, to measure the rank-order similarity of their diction.

In isolation, a measure of similarity between two genres is hard to interpret. But if you run the test repeatedly to compare genres at different points in time, the changes can tell you when the diction of the genres becomes more or less similar.

NewverseToNon — Spearman similarity to nonfiction, measured at 5-year intervals. At each interval, a 39-year chunk of the collection (19 years on either side of the midpoint) is being selected for comparison.

In the graph above, I’ve done that with four genres, in a collection of 3,724 eighteenth- and nineteenth-century volumes (constructed in part by TCP and in part by Jordan Sellers — see acknowledgments), using the 10,000 most frequent words in the collection, excluding proper nouns. The black line at the top is flat, because nonfiction is always similar to itself. But the other lines decline as poetry, drama, and fiction become progressively less similar to nonfiction where word choice is concerned. Unsurprisingly, prose fiction is always more similar to nonfiction than poetry is. But the steady decline in the similarity of all three genres to nonfiction is interesting. Literary histories of this period have tended to pivot on William Wordsworth’s rebellion against a specialized “poetic diction” — a story that would seem to suggest that the diction of 19c poetry should be less different from prose than 18c poetry had been. But that’s not the pattern we’re seeing here: instead it appears that a differentiation was setting in between literary and nonliterary language.

This should be described as a differentiation of “diction” rather than style. To separate style from content (for instance to determine authorship) you need to focus on the frequencies of common words. But when critics discuss “diction,” they’re equally interested, I think, in common and less common words — and that’s the kind of measure of similarity that Spearman’s correlation will give you (Kilgarriff 2001).

The graph above makes it look as though nonfiction was remaining constant while other genres drifted away from it. But we are after all graphing a comparison with two sides. This raises the question: were poetry, fiction, and drama changing relative to nonfiction, or was nonfiction changing relative to them? But of course the answer is “both.”

DistFromInit — At each 5-year interval, the Spearman similarity is being measured between the 40-year span surrounding that point and the period 1700-1740.

Here we’re comparing each genre to its own past. The language of nonfiction changes somewhat more rapidly than the language of the other genres, but none of them remain constant. There is no fixed reference point in this world, which is why I’m talking about the “differentiation” of two categories. But even granting that, we might want to pose another skeptical question: when literary genres become less like nonfiction, is that merely a sign of some instability in the definition of “nonfiction”? Did it happen mostly because, say, the nineteenth century started to publish on specialized scientific topics? We can address this question to some extent by selecting a more tightly defined subset of nonfiction as a reference point — say, biographies, letters, and orations.

NewverseToBiog — The Spearman similarities here happen to be generated on the top 5000 words rather than the top 10000, but I have tried both wordsets and it makes very little difference.

Even when we focus on this relatively stable category, we see significant differentiation. Two final skeptical questions need addressing before I try to explain what happened. First, I’ve been graphing results so far as solid lines, because our eyes can’t sort out individual data points for four different variables at once. But a numerically savvy reader will want to see some distributions and error bars before assessing the significance of these results. So here are yearly values for fiction. In some cases these are individual works of fiction, though when there are two or more works of fiction in a single year they have been summed and treated as a group. Each year of fiction is being compared against biographies, letters, and orations for 19 years on either side.

That’s a fairly persuasive trend. You may, however, notice that the Spearman similarities for individual years on this graph are about .1 lower than they were when we graphed fiction as a 39-year moving window. In principle Spearman similarity is independent of corpus size, but it can be affected by the diversity of a corpus. The similarity between two individual texts is generally going to be lower than the similarity between two large and diverse corpora. So could the changes we’ve seen be produced by changes in corpus size? There could be some effect, but I don’t think it’s large enough to explain the phenomenon. [See update at the bottom of this post. The results are in fact even clearer when you keep corpus size constant. -Ed.] The sizes of the corpora for different genres don’t change in a way that would produce the observed decreases in similarity; the fiction corpus, in particular, gets larger as it gets less like nonfiction. Meanwhile, it is at the same time becoming more like poetry. We’re dealing with some factor beyond corpus size.

So how then do we explain the differentiation of literary and nonliterary diction? As I started by saying, I don’t expect to provide a complete answer: I’m raising a question. But I can offer a few initial leads. In some ways it’s not surprising that novels would gradually become less like biographies and letters. The novel began very much as faked biography and faked correspondence. Over the course of the period 1700-1900 the novel developed a sharper generic identity, and one might expect it to develop a distinct diction. But the fact that poetry and drama seem to have experienced a similar shift (together with the fact that literary genres don’t seem to have diverged significantly from each other) begins to suggest that we’re looking at the emergence of a distinctively “literary” diction in this period.

To investigate the character of that diction, we need to compare the vocabulary of genres at many different points. If we just compared late-nineteenth-century fiction to late-nineteenth-century nonfiction, we would get the vocabulary that characterized fiction at that moment, but we wouldn’t know which aspects of it were really new. I’ve done that on the side here, using the Mann-Whitney rho test I described in an earlier post. As you’ll see, the words that distinguish fiction from nonfiction from 1850 to 1900 are essentially a list of pronouns and verbs used to describe personal interaction. But that is true to some extent about fiction in any period. We want to know what aspects of diction had changed.

In other words, we want to find the words that became overrepresented in fiction as fiction was becoming less like nonfiction prose. To find them, I compared fiction to nonfiction at five-year intervals between 1720 and 1880. At each interval I selected a 39-year slice of the collection and ranked words according to the extent to which they were consistently more prominent in fiction than nonfiction (using Mann-Whitney rho). After moving through the whole timeline you end up with a curve for each word that plots the degree to which it is over or under-represented in fiction over time. Then you sort the words to find ones that tend to become more common in fiction as the whole genre becomes less like nonfiction. (Technically, you’re looking for an inverse Pearson’s correlation, over time, between the Mann-Whitney rho for this word and the Spearman’s similarity between genres.) Here’s a list of the top 60 words you find when you do that:

It’s not hard to see that there are a lot of words for emotional conflict here (“horror, courage, confused, eager, anxious, despair, sorrow, dread, agony”). But I would say that emotion is just one aspect of a more general emphasis on subjectivity, ranging from verbs of perception (“listen, listened, watched, seemed, feel, felt”) to explicitly psychological vocabulary (“nerves, mind, unconscious, image, perception”) to questions about the accuracy of perception (“dream, real, sight, blind, forget, forgot, mystery, mistake”). To be sure, there are other kinds of words in the list (“cottage, boy, carriage”). But since we’re looking at a change across a period of 200 years, I’m actually rather stunned by the thematic coherence of the list. For good measure, here are words that became relatively less common in fiction (or more common in nonfiction — that’s the meaning of “relatively”) as the two genres differentiated:

Looking at that list, I’m willing to venture out on a limb and suggest that fiction was specializing in subjectivity while nonfiction was tending to view the world from an increasingly social perspective (“executive, population, colonists, department, european, colonists, settlers, number, individuals, average.”)

Now, I don’t pretend to have solved this whole problem. First of all, the lists I just presented are based on fiction; I haven’t yet assessed whether there’s really a shared “literary diction” that unites fiction with poetry and drama. Jordan and I probably need to build up our collection a bit before we’ll know. Also, the technique I just used to select lists of words looks for correlations across the whole period 1700-1900, so it’s going to select words that have a relatively continuous pattern of change throughout this period. But it’s also entirely possible that “the differentiation of literary and nonliterary diction” was a phenomenon composed of several different, overlapping changes with a smaller “wavelength” on the time axis. So I would say that there’s lots of room here for alternate/additional explanations.

But really, this is a question that does need explanation. Literary scholars may hate the idea of “counting words,” but arguments about a distinctively “literary” language have been central to literary criticism from John Dryden to the Russian Formalists. If we can historicize that phenomenon — if we can show that a systematic distinction between literary and nonliterary language emerged at a particular moment for particular reasons — it’s a result that ought to have significance even for literary scholars who don’t consider themselves digital humanists.

By the way, I think I do know why the results I’m presenting here don’t line up with our received impression that “poetic diction” is an eighteenth-century phenomenon that fades in the 19c. There is a two-part answer. For one thing, part of what we perceive as poetic diction in the 18c is orthography (“o’er”, “silv’ry”). In this collection, I have deliberately normalized orthography, so “silv’ry” is treated as equivalent to “silvery,” and that aspect of “poetic diction” is factored out.

But we may also miss differentiation because we wrongly assume that plain or vivid language cannot be itself a form of specialization. Poetic diction probably did become more accessible in the 19c than it had been in the 18c. But this isn’t the same thing as saying that it became less specialized! A self-consciously plain or restricted diction still counts as a mode of specialization relative to other written genres. More on this in a week or two …

Finally, let me acknowledge that the work I’m doing here is built on a collaborative foundation. Laura Mandell helped me obtain the TCP-ECCO volumes before they were public, and Jordan Sellers selected most of the nineteenth-century collection on which this work is based — something over 1,600 volumes. While Jordan and I were building this collection, we were also in conversation with Loretta Auvil, Boris Capitanu, Tanya Clement, Ryan Heuser, Matt Jockers, Long Le-Khac, Ben Schmidt, and John Unsworth, and were learning from them how to do this whole “text mining” thing. The R/MySQL infrastructure for this is pretty directly modeled on Ben’s. Also, since the work was built on a collaborative foundation, I’m going to try to give back by sharing links to my data and code on this “Open Data” page.

References
Adam Kilgarriff, “Comparing Corpora,” International Journal of Corpus Linguistics 6.1 (2001): 97-133.

[UPDATE Monday Feb 27th, 7 pm: After reading Ben Schmidt’s comment below, I realized that I really had to normalize corpus size. “Probably not a problem” wasn’t going to cut it. So I wrote a script that samples a million-word corpus for each genre every two years. As long as I was addressing that problem, I figured I would address another one that had been nagging at my conscience. I really ought to be comparing a different wordlist each time I run the comparison. It ought to be the top 5000 words in each pair of corpora that get compared — not the top 5000 words in the collection as a whole.

The first time I ran the improved version I got a cloud of meaningless dots, and for a moment I thought my whole hypothesis about genre had been produced by a ‘loose optical cable.’ Not a good moment. But it was a simple error, and once I fixed it I got results that were actually much clearer than my earlier graphs.

I suppose you could argue that, since document size varies across time, it’s better to select corpora that have a fixed number of documents rather than a fixed word size. I ran the script that way too, and it produces results that are noisier but still unambiguous. The moral of the story is: it’s good to have blog readers who keep you honest and force you to clean up your methodology!]

18c 19c math methodology ngrams

Exploring the relationship between topics and trends.

Post author By tedunderwood
Post date November 18, 2011
1 Comment on Exploring the relationship between topics and trends.

I’ve been talking about correlation since I started this blog. Actually, that was the reason why I did start it: I think literary scholars can get a huge amount of heuristic leverage out of the fact that thematically and socially-related words tend to rise and fall together. It’s a simple observation, and one that stares you in the face as soon as you start to graph word frequencies on the time axis.¹ But it happens to be useful for literary historians, because it tends to uncover topics that also pose periodizable kinds of puzzles. Sometimes the puzzle takes the form of a topic we intuitively recognize (say, the concept of “color”) that increases or decreases in prominence for reasons that remain to be explained:

At other times, the connection between elements of the topic is not immediately intuitive, but the terms are related closely enough that their correlation suggests a pattern worthy of further exploration. The relationship between terms may be broadly historical:

Or it may involve a pattern of expression that characterizes a periodizable style:

Of course, as the semantic relationship between terms becomes less intuitively obvious, scholars are going to wonder whether they’re looking at a real connection or merely an accidental correlation. “Ardent” and “tranquil” seem like opposites; can they really be related as elements of a single discourse? And what’s the relationship to “bosom,” anyway?

Ultimately, questions like this have to be addressed on a case-by-case basis; the significance of the lead has to be fleshed out both with further analysis, and with close reading.

But scholars who are wondering about the heuristic value of correlation may be reassured to know that this sort of lead does generally tend to pan out. Words that correlate with each other across the time axis do in practice tend to appear in the same kinds of volumes. For instance, if you randomly select pairs of words from the top 10,000 words in the Google English ngrams dataset 1700-1849,² measure their correlation with each other in that dataset across the period 1700-1849, and then measure their tendency to appear in the same volumes in a different collection³ (taking the cosine similarity of term vectors in a term-document matrix), the different measures of association correlate with each other strongly. (Pearson’s r is 0.265, significant at p < 0.0005.) Moreover, the relationship holds (less strongly, but still significantly) even in adjacent centuries: words that appear in the same eighteenth-century volumes still tend to rise and fall together in the nineteenth century.

Why should humanists care about the statistical relationship between two measures of association? It means that correlation-mining is in general going to be a useful way of identifying periodizable discourses. If you find a group of words that correlate with each other strongly, and that seem related at first glance, it's probably going to be worthwhile to follow up the hunch. You’re probably looking at a discourse that is bound together both diachronically (in the sense that the terms rise and fall together) and topically (in the sense that they tend to appear in the same kinds of volumes).

Ultimately, literary historians are going to want to assess correlation within different genres; a dataset like Google's, which mixes all genres in a single pool, is not going to be an ideal tool. However, this is also a domain where size matters, and in that respect, at the moment, the ngrams dataset is very helpful. It becomes even more helpful if you correct some of the errors that vitiate it in the period before 1820. A team of researchers at Illinois and Stanford⁴, supported by the Andrew W. Mellon Foundation, has been doing that over the course of the last year, and we're now able to make an early version of the tool available on the web. Right now, this ngram viewer only covers the period 1700-1899, but we hope it will be useful for researchers in that period, because it has mostly corrected the long-s problem that confufes opt1cal charader readers in the 18c — as well as a host of other, less notorious problems. Moreover, it allows researchers to mine correlations in the top 10,000 words of the lexicon, instead of trying words one by one to see whether an interesting pattern emerges. In the near future, we hope to expand the correlation miner to cover the twentieth century as well.

For further discussion of the statistical relationship between topics and trends, see this paper submitted to DHCS 2011.

UPDATE Nov 22, 2011: At DHCS 2011, Travis Brown pointed out to me that Topics Over Time (Wang and McCallum) might mine very similar patterns in a more elegant, generative way. I hope to find a way to test that method, and may perhaps try to build an implementation for it myself.

References
1) Ryan Heuser and I both noticed this pattern last winter. Ryan and Long Le-Khac presented on a related topic at DH2011: Heuser, Ryan, and Le-Khac, Long. “Abstract Values in the 19th Century British Novel: Decline and Transformation of a Semantic Field,” Digital Humanities 2011, Stanford University.

2) Jean-Baptiste Michel*, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, William Brockman, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak, and Erez Lieberman Aiden*. “Quantitative Analysis of Culture Using Millions of Digitized Books.” Science (Published online ahead of print: 12/16/2010)

3) The collection of 3134 documents (1700-1849) I used for this calculation was produced by combining ECCO-TCP volumes with nineteenth-century volumes selected and digitized by Jordan Sellers.

4) The SEASR Correlation Analysis and Ngrams Viewer was developed by Loretta Auvil and Boris Capitanu at the Illinois Informatics Institute, modeled on prototypes built by Ted Underwood, University of Illinois, and Ryan Heuser, Stanford.

18c 19c collection-building

The challenges of digital work on early-19c collections.

Post author By tedunderwood
Post date October 7, 2011
7 Comments on The challenges of digital work on early-19c collections.

I’ve been posting mostly about collections built by other people (TCP-ECCO and Google). But I’m also in the process of building a small (thousand-title) 19c collection myself, in collaboration with E. Jordan Sellers. Jordan is selecting titles for the collection; I’m writing the Python scripts that process the texts. This is a modest project intended to support research for a few years, not a model for long-term curatorial practice. But we’ve encountered a few problems specific to the early 19c, and I thought I might share some of our experience and tools in case they’re useful for other early-19c scholars.

Smellie2 — Literary and Characteristical Lives (1800), by William and Alexander Smellie. Note esp. the ligatures in 'first' and 'section.'

I originally wanted to create a larger collection, containing twenty or thirty thousand volumes, on the model of Ben Schmidt’s impressive work with nineteenth-century volumes vacuumed up from the Open Library. But because I needed a collection that bridged the eighteenth and nineteenth centuries, I found I had to proceed more slowly. The eighteenth century itself wasn’t the problem. Before 1800, archaic typography makes most optical character recognition unreliable — but for that very reason, TCP-ECCO has been producing clean, manually-keyed versions of 18c texts, enough at least for a small collection. The later 19c also isn’t a problem, because after 1830 or so, OCR quality is mostly adequate.

OCR2 — OCR version of Smellie, contributed by Columbia University Libraries to the Internet Archive.

But between 1800 and (say) 1830, you fall between two stools. It’s technically the nineteenth century, so people assume that OCR ought to work. But in practice, volumes from this period still have a lot of eighteenth-century typographical quirks, including loopy ligatures, the notorious “long s,” and worn or broken type. So the OCR is often pretty vile. I’m willing to put up with background noise if it’s evenly distributed. But these errors are distributed unevenly across the lexicon and across time, so they could actually distort conclusions if left unaddressed.

I decided to build a Python script to do post-processing correction of OCR. There are a lot of ways to do this; my approach was modeled on a paper written by Thomas A. Lasko and Susan E. Hauser for the National Library of Medicine. Briefly, what they show is that OCR correction becomes much more reliable when the program is given statistical information about the language, and errors, to be expected in a given domain. They’re working with contemporary text, but the principle holds even more strongly when you’re working in a different historical period. A generic spellchecker won’t perform well with texts that contain period spellings (“despatch,” “o’erflow’d”), systematic f/s substitution, and a much higher proportion of Latin and French than we’re used to. If your system corrects every occurrence of “même” to “mime,” you’re going to end up with a surprising number of mimes; if you accept “foul” at face value as a correctly-spelled word, you’re going to have very little “soul” in your collection.

Briefly, I customized my spellchecker for the early 19c in three ways:

A few other tricks are needed to optimize speed, and to make sure the script doesn’t over-correct proper nouns; anyone who’s interested in doing this should drop me a line for a fuller description and a copy of the code.

Corrected2 — Automatically corrected version.

The results aren’t perfect, but they’re good enough to be usable (I am also recording the number of corrections and uncorrectable tokens so that I can assess margins of error later on).

I haven’t packaged this code yet for off-the-shelf use; it’s still got a few trailing wires. But if you want to cannibalize/adapt it, I’d be happy to give you a copy. Perhaps more importantly, I’d like to share a couple of sets of rules that might be helpful for anyone who’s attempting to normalize an 18/19c collection. Both of these rulesets are tab-delimited utf-8 .txt files. First, my list of 4600 rules for correcting 18/19c spellings, including syncopated past-tense forms like “bury’d” and “drop’d.” (Note that syncope cannot always be fixed simply by adding back an “e.” Rules for normalizing poetic syncope — “flow’ry,” “ta’en” — are clustered at the end of the file, so you can delete them if desired.) This ruleset has been transformed by a long series of joins and filtering operations, and edited manually, but I should acknowledge that part of the original list was borrowed from the source files that accompany WordHoard, developed at Northwestern University. I should also warn potential users that these rules are designed to normalize spelling to modern British practice.

The other thing it might be useful to share is a list of 2grams extracted from the Google English corpus, that I use for contextual spellchecking. This includes only 2grams where one of the two elements is a token like “fix” or “flip” that could be read either as a valid word or as an OCR error caused by the long s. Since the long s is also a problem in the Google dataset itself up to 1820, this list was based on frequencies from 1825-50. That’s not perfect for correcting texts in the 1800-1820 period, but I find that in practice it’s adequate. There are two columns here: the 2gram itself, and the frequency.

18c 19c math ngrams

Words that appear in the same 18c volumes also track each other over time, through the 19c.

Post author By tedunderwood
Post date September 19, 2011
No Comments on Words that appear in the same 18c volumes also track each other over time, through the 19c.

I wrote a long post last Friday arguing that topic-modeling an 18c collection is a reliable way of discovering eighteenth- and nineteenth-century trends, even in a different collection.

But when I woke up on Saturday I realized that this result probably didn’t depend on anything special about “topic modeling.” After all, the topic-modeling process I was using was merely a form of clustering. And all the clustering process does is locate the hypothetical centers of more-or-less-evenly-sized clouds of words in vector space. It seemed likely that the specific locations of these centers didn’t matter. The key was the underlying measure of association between words — “cosine similarity in vector space,” which is a fancy way of saying “tendency to be common in the same 18c volumes.” Any group of words that were common (and uncommon) in the same 18c volumes would probably tend to track each other over time, even into the next century.

GratifyCurves540 — Six words that tend to occur in the same 18c volumes as 'gratify' (in TCP-ECCO), plotted over time in a different collection (a corrected version of Google ngrams).

To test this I wrote a script that chose 200 words at random from the top 5000 in a collection of 2,193 18c volumes (drawn from TCP-ECCO with help from Laura Mandell), and then created a cluster around each word by choosing the 25 words most likely to appear in the same volumes (cosine similarity). Would pairs of words drawn from these randomly distributed clusters also show a tendency to correlate with each other over time?

They absolutely do. The Fisher weighted mean pairwise r for all possible pairs drawn from the same cluster is .267 in the 18c and .284 in the 19c (the 19c results are probably better because Google’s dataset is better in the 19c even after my efforts to clean the 18c up*). At n = 100 (measured over a century), both correlations have rock-solid statistical significance, p < .01. And in case you're wondering … yes, I wrote another script to test randomly selected words using the same statistical procedure, and the mean pairwise r for randomly selected pairs (factoring out, as usual, partial correlation with the larger 5000-word group they’re selected from) is .0008. So I feel confident that the error level here is low.**

What does this mean, concretely? It means that the universe of word-frequency data is not chaotic. Words that appear in the same discursive contexts tend, strongly, to track each other over time, and (although I haven’t tested this rigorously yet), it’s not hard to see that the converse proposition is also going to hold true: words that track each other over time are going to tend to have contextual associations as well.

To put it even more bluntly: pick a word, any word! We are likely to be able to define not just a topic associated with that word, but a discourse — a group of words that are contextually related and that also tend to wax and wane together over time. I don’t imagine that in doing so we prove anything of humanistic significance, but I do think it means that we can raise several thousand significant questions. To start with: what was the deal with eagerness, gratification, and disappointment in the second half of the eighteenth century?

* A better version of Google’s 18c dataset may be forthcoming from the NCSA.

** For people who care about the statistical reliability of data-mining, here’s the real kicker: if you run a Benjamini-Hochberg procedure on these 200 randomly-generated clusters, 134 of them have significance at p < .05 in the 19c even after controlling for the false discovery rate. To put that more intelligibly, these are guaranteed not to be xkcd’s green jelly beans. The coherence of these clusters is even greater than the ones produced by topic-modeling, but that’s probably because they are on average slightly smaller (25 words); I have yet to test the relative importance of different generation procedures while holding cluster size rigorously constant.

18c 19c math methodology topic modeling trend mining

Topics tend to be trends. Really: p < .05!

Post author By tedunderwood
Post date September 16, 2011
No Comments on Topics tend to be trends. Really: p < .05!

While I’m fascinated by cases where the frequencies of two, or ten, or twenty words closely parallel each other, my conscience has also been haunted by a problem with trend-mining — which is that it always works. There are so many words in the English language that you’re guaranteed to find groups of them that correlate, just as you’re guaranteed to find constellations in the night sky. Statisticians call this the problem of “multiple comparisons”; it rests on a fallacy that’s nicely elucidated in this classic xkcd comic about jelly beans.

Simply put: it feels great to find two conceptually related words that correlate over time. But we don’t know whether this is a significant find, unless we also know how many potentially related words don’t correlate.

One way to address this problem is to separate the process of forming hypotheses from the process of testing them. For instance, we could use topic modeling to divide the lexicon up into groups of terms that occur in the same contexts, and then predict that those terms will also correlate with each other over time. In making that prediction, we turn an undefined universe of possible comparisons into a finite set.

Once you create a set of topics, plotting their frequencies is simple enough. But plotting the aggregate frequency of a group of words isn’t the same thing as “discovering a trend,” unless the individual words in the group actually correlate with each other over time. And it’s not self-evident that they will.

Silence2 — The top 15 words in topic #91, "Silence/Listened," and their cosine similarity to the centroid.

So I decided to test the hypothesis that they would. I used semi-fuzzy clustering to divide one 18c collection (TCP-ECCO) into 200 groups of words that tend to appear in the same volumes, and then tested the coherence of those topics over time in a different 18c collection (a much-cleaned-up version of the Google ngrams dataset I produced in collaboration with Loretta Auvil and Boris Capitanu at the NCSA). Testing hypotheses in a different dataset than the one that generated them is a way of ensuring that we aren’t simply rediscovering the same statistical accidents a second time.

To make a long story short, it turns out that topics have a statistically significant tendency to be trends (at least when you’re working with a century-sized domain). Pairs of words selected from the same topic correlated significantly with each other even after factoring out other sources of correlation*; the Fisher weighted mean r for all possible pairs was 0.223, which measured over a century (n = 100) is significant at p < .05.

In practice, the coherence of different topics varied widely. And of course, any time you test a bunch of hypotheses in a row you're going to get some false positives. So the better way to assess significance is to control for the "false discovery rate." When I did that (using the Benjamini-Hochberg method) I found that 77 out of the 200 topics cohered significantly as trends.

There are a lot of technical details, but I'll defer them to a footnote at the end of this post. What I want to emphasize first is the practical significance of the result for two different kinds of researchers. If you're interested in mining diachronic trends, then it may be useful to know that topic-modeling is a reliable way of discovering trends that have real statistical significance and aren’t just xkcd’s “green jelly beans.”

The top 15 terms in topic #89, "Enemy/Attacked," and their cosine similarity to the centroid.

Conversely, if you're interested in topic modeling, it may be useful to know that the topics you generate will often be bound together by correlation over time as well. (In fact, as I’ll suggest in a moment, topics are likely to cohere as trends beyond the temporal boundaries of your collection!)

Finally, I think this result may help explain a phenomenon that Ryan Heuser, Long Le-Khac, and I have all independently noticed: which is that groups of words that correlate over time in a given collection also tend to be semantically related. I've shown above that topic modeling tends to produce diachronically coherent trends. I suspect that the converse proposition is also true: clusters of words linked by correlation over time will turn out to have a statistically significant tendency to appear in the same contexts.

Why are topics and trends so closely related? Well, of course, when you’re topic-modeling a century-long collection, co-occurrence has a diachronic dimension to start with. So the boundaries between topics may already be shaped by change over time. It would be interesting to factor time out of the topic-modeling process, in order to see whether rigorously synchronic topics would still generate diachronic trends.

I haven’t tested that yet, but I have tried another kind of test, to rule out the possibility that we’re simply rediscovering the same trends that generated the topics in the first place. Since the Google dataset is very large, you can also test whether 18c topics continue to cohere as trends in the nineteenth century. As it turns out, they do — and in fact, they cohere slightly more strongly! (In the 19c, 88 out of 200 18c topics cohered significantly as trends.) The improvement is probably a clue that Google’s dataset gets better in the nineteenth century (which god knows, it does) — but even if that’s true, the 19c result would be significant enough on its own to show that topic modeling has considerable predictive power.

Practically, it’s also important to remember that “trends” can play out on a whole range of different temporal scales.

For instance, here’s the trend curve for topic #91, “Silence / Listened,” which is linked to the literature of suspense, and increases rather gradually and steadily from 1700 to the middle of the nineteenth century.

By contrast, here’s the trend curve for topic #89, “Enemy/Attacked,” which is largely used in describing warfare. It doesn’t change frequency markedly from beginning to end; instead it bounces around from decade to decade with a lot of wild outliers. But it is in practice a very tightly-knit trend: a pair of words selected from this topic will have on average 31% of their variance in common. The peaks and outliers are not random noise: they’re echoes of specific armed conflicts.

* Technical details: Instead of using Latent Dirichlet Allocation for topic modeling, I used semi-fuzzy c-means clustering on term vectors, where term vectors are defined in the way I describe in this technical note. I know LDA is the standard technique, and it seems possible that it would perform even better than my clustering algorithm does. But in a sufficiently large collection of documents, I find that a clustering algorithm produces, in practice, very coherent topics, and it has some other advantages that appeal to me. The “semi-fuzzy” character of the algorithm allows terms to belong to more than one cluster, and I use cosine similarity to the centroid to define each term’s “degree of membership” in a topic.

I only topic-modeled the top 5000 words in the TCP-ECCO collection. So in measuring pairwise correlations of terms drawn from the same topic, I had to calculate it as a partial correlation, controlling for the fact that terms drawn from the top 5k of the lexicon are all going to have, on average, a slight correlation with each other simply by virtue of being drawn from that larger group.

18c 19c fiction methodology

For most literary scholars, text mining is going to be an exploratory tool.

Post author By tedunderwood
Post date August 15, 2011
4 Comments on For most literary scholars, text mining is going to be an exploratory tool.

Having just returned from a conference of Romanticists, I’m in a mood to reflect a bit about the relationship between text mining and the broader discipline of literary studies. This entry will be longer than my usual blog post, because I think I’ve got an argument to make that demands a substantial literary example. But you can skip over the example to extract the polemical thesis if you like!

At the conference, I argued that literary critics already practice a crude form of text mining, because we lean heavily on keyword search when we’re tracing the history of a topic or discourse. I suggested that information science can now offer us a wider range of tools for mapping archives — tools that are subtler, more consonant with our historicism, and maybe even more literary than keyword search is.

At the same time, I understand the skepticism that many literary critics feel. Proving a literary thesis with statistical analysis is often like cracking a nut with a jackhammer. You can do it: but the results are not necessarily better than you would get by hand.

One obvious solution would be to use text mining in an exploratory way, to map archives and reveal patterns that a critic could then interpret using nuanced close reading. I’m finding that approach valuable in my own critical practice, and I’d like to share an example of how it works. But I also want to reflect about the social forces that stand in the way of this obvious compromise between digital and mainstream humanists — leading both sides to assume that quantitative analysis ought to contribute instead by proving literary theses with increased certainty.

Annote — Part of a topic tree based on a generically diverse collection of 2200 18c texts.

I’ll start with an example. If you don’t believe text mining can lead to literary insights, bear with me: this post starts with some goofy-looking graphs, but develops into an actual hypothesis about the Romantic novel based on normal kinds of literary evidence. But if you’re willing to take my word that text-mining can produce literary leads, or simply aren’t interested in Romantic-era fiction, feel free to skip to the end of this (admittedly long!) post for the generalizations about method.

Several months ago, when I used hierarchical clustering to map eighteenth-century diction on this blog, I pointed to a small section of the resulting tree that intriguingly mixed language about feeling with language about time. It turned out that the words in this section of the tree were represented strongly in late-eighteenth-century novels (novels, for instance, by Frances Burney, Sophia Lee, and Ann Radcliffe). Other sections of the tree, associated with poetry or drama, had a more vivid kind of emotive language, and I wondered why novels would combine an emphasis on feeling or exclamation (“felt,” “cried”) with the abstract topic of duration (“moment,” “longer”). It seemed an oddly phenomenological way to think about emotion.

But I also realized that hierarchical clustering is a fairly crude way of mapping conceptual space in an archive. The preferred approach in digital humanities right now is topic modeling, which does elegantly handle problems like polysemy. However, I’m not convinced that existing methods of topic modeling (LDA and so on) are flexible enough to use for exploration. One of their chief advantages is that they don’t require the human user to make judgment calls: they automatically draw boundaries around discrete “topics.” But for exploratory purposes boundaries are not an advantage! In exploring an archive, the goal is not to eliminate ambiguity so that judgment calls are unnecessary: the goal is to reveal intriguing ambiguities, so that the human user can make judgments about them.

If this is our goal, it’s probably better to map diction as an associative web. Fortunately, it was easy to get from the tree to a web, because the original tree had been based on an algorithm that measured the strength of association between any two words in the collection. Using the same algorithm, I created a list of twenty-five words most strongly associated with the branch that had interested me (“instantly,” “cried,” “felt,” moment,” “longer,”) and then used the strengths of association between those words to model the whole list as a force-directed graph. In this graph, words are connected by “springs” that pull them closer together; the darker the line, the stronger the association between the two words, and the more tightly they will be bound together in the graph. (The sizes of words are loosely proportional to their frequency in the collection, but only very loosely.)

A graph like this is not meant to be definitive: it’s a brainstorming tool that helps me explore associations in a particular collection (here, a generically diverse collection of eighteenth-century writing). On the left side, we see a triangle of feminine pronouns (which are strongly represented in the same novels where “felt,” “moment,” and so on are strongly represented) as well as language that defines domestic space (“quitting,” “room”). On the right side of the graph, we see a range of different kinds of emotion. And yet, looking at the graph as a whole, there is a clear emphasis on an intersection of feeling and time — whether the time at issue is prospective (“eagerly,” “hastily,” “waiting”) or retrospective (“recollected,” “regret”).

In particular, there are a lot of words here that emphasize temporal immediacy, either by naming a small division of time (“moment,” “instantly”), or by defining a kind of immediate emotional response (“surprise,” “shocked,” “involuntarily”). I have highlighted some of these words in red; the decision about which words to include in the group was entirely a human judgment call — which means that it is open to the same kind of debate as any other critical judgment.

But the group of words I have highlighted in red — let’s call it a discourse of temporal immediacy — does turn out to have an interesting historical profile. We already know that this discourse was common in late-eighteenth-century novels. But we can learn more about its provenance by restricting the generic scope of the collection (to fiction) and expanding its temporal scope to include the nineteenth as well as eighteenth centuries. Here I’ve graphed the aggregate frequency of this group of words in a collection of 538 works of eighteenth- and nineteenth-century fiction, plotted both as individual works and as a moving average. [The moving average won’t necessarily draw a line through the center of the “cloud,” because these works vary greatly in size. For instance, the collection includes about thirty of Hannah More’s “Cheap Repository Tracts,” which are individually quite short, and don’t affect the moving average more than a couple of average-sized novels would, although they create an impressive stack of little circles in the 1790s.]

The shape of the curve here suggests that we’re looking at a discourse that increased steadily in prominence through the eighteenth century and peaked (in fiction) around the year 1800, before sinking back to a level that was still roughly twice its early-eighteenth-century frequency.

Why might this have happened? It’s always a good idea to start by testing the most boring hypothesis — so a first guess might be that words like “moment” and “instantly” were merely displacing some set of close synonyms. But in fact most of the imaginable synonyms for this set of words correlate very closely with them. (This is true, for instance, of words like “sudden,” “abruptly,” and “alarm.”)

Another way to understand what’s going on would be to look at the works where this discourse was most prominent. We might start by focusing on the peak between 1780 and 1820. In this period, the works of fiction where language of temporal immediacy is most prominent include

Zofloya

Hermione, or the Orphan Sisters

The Monk

A Sicilian Romance

The Castles of Athlin and Dunbayne

The Romance of the Forest

Cecilia

The Wanderer

Adeline Mowbray

The Recess

There is a strong emphasis here on the Gothic, but perhaps also, more generally, on women writers. The works of fiction where the same discourse is least prominent would include

Coelebs in Search of a Wife

Hermsprong

Life; or the Adventures of William Ramble

Castle Rackrent

The Children’s Friend

Vaurien; or, Sketches of the Times

Many of these works are deliberately old-fashioned in their approach to narrative form: they are moral parables, or stories for children, or first-person retrospective narratives (like Rackrent), or are told by Fieldingesque narrators who feel free to comment and summarize extensively (as in the works by Disraeli and Trusler).

After looking closely at the way the language of temporal immediacy is used in Frances Burney, Cecilia (1782), and Sophia Lee, The Recess (1785), it seems to me that it had both a formal and an affective purpose.

Formally, it foregrounded a newly sharp kind of temporal framing. If we believe Ian Watt, part of the point of the novel form is to emulate the immediacy of first-hand experience — a purpose that can be slightly at odds with the retrospective character of narrative. Eighteenth-century novelists fought the distancing effect of retrospection in a lot of ways: epistolary narrative, discovered journals and so on are ways of bringing the narrative voice as close as possible to the moment of experience. But those tricks have limits; at some point, if your heroine keeps running off to write breathless letters between every incident, Henry Fielding is going to parody you.

By the late eighteenth century it seems to me novelists were starting to work out ways of combining temporal immediacy with ordinary retrospective narration. Maybe you can’t literally have your narrator describe events as they’re taking place, but you can describe events in a way that highlights their temporal immediacy. This is one of the things that makes Frances Burney read more like a nineteenth-century novelist than like Defoe; she creates a tight temporal frame for each event, and keeps reminding her readers about the tightness of the frame. So, a new paragraph will begin “A few moments after he was gone …” or “At that moment Sir Robert himself burst into the Room …” or “Cecilia protested she would go instantly to Mr Briggs,” to choose a few examples from a single chapter of Cecilia (my italics, 363-71). We might describe this vaguely as a way of heightening suspense — but there are of course many different ways to produce suspense in fiction. Narratology comes closer to the question at issue when it talks about “pacing,” but unless someone has already coined a better term, I think I would prefer to describe this late-18c innovation as a kind of “temporal framing,” because the point is not just that Burney uses “scene” rather than “summary” to make discourse time approximate story time — but that she explicitly divides each “scene” into a succession of discrete moments.

There is a lot more that could be said about this aspect of narrative form. For one thing, in the Romantic era it seems related to a particular way of thinking about emotion — a strategy that heightens emotional intensity by describing experience as if it were divided into a series of instananeous impressions. E.g, “In the cruelest anxiety and trepidation, Cecilia then counted every moment till Delvile came …” (Cecilia, 613). Characters in Gothic fiction are “every moment expecting” some start, shock, or astonishment. “The impression of the moment” is a favorite phrase for both Burney and Sophia Lee. On one page of The Recess, a character “resign[s] himself to the impression of the moment,” although he is surrounded by a “scene, which every following moment threatened to make fatal” (188, my italics).

In short, fairly simple tools for mapping associations between words can turn up clues that point to significant formal, as well as thematic, patterns. Maybe I’m wrong about the historical significance of those patterns, but I’m pretty sure they’re worth arguing about in any case, and I would never have stumbled on them without text mining.

On the other hand, when I develop these clues into a published article, the final argument is likely to be based largely on narratology and on close readings of individual texts, supplemented perhaps by a few simple graphs of the kind I’ve provided above. I suppose I could master cutting-edge natural language processing, in order to build a fabulous tool that would actually measure narrative pace, and the division of scenes into incidents. That would be fun, because I love coding, and it would be impressive, since it would prove that digital techniques can produce literary evidence. But the thing is, I already have an open-source application that can measure those aspects of narrative form, and it runs on inexpensive hardware that requires only water, glucose, and caffeine.

The methodological point I want to make here is that relatively simple forms of text mining, based on word counts, may turn out to be the techniques that are in practice most useful for literary critics. Moreover, if I can speak frankly: what makes this fact hard for us to acknowledge is not technophilia per se, but the nature of the social division between digital humanists and mainstream humanists. Literary critics who want to dismiss text mining are fond of saying “when you get right down to it, it’s just counting words.” (At moments like this we seem to forget everything 20c literary theorists ever learned from linguistics, and go back to treating language as a medium that ideally, ought to be immaterial and transparent. Surely a crudely verbal approach — founded on lumpy, ambiguous words — can never tell us anything about the conceptual subtleties of form and theme!) Stung by that critique, digital humanists often feel we have to prove that our tools can directly characterize familiar literary categories, by doing complex analyses of syntax, form, and sentiment.

I don’t want to rule out those approaches; I’m not interested in playing the game “Computers can never do X.” They probably can do X. But we’re already carrying around blobs of wetware that are pretty good at understanding syntax and sentiment. Wetware is, on the other hand, terrible at counting several hundred thousand words in order to detect statistical clues. And clues matter. So I really want to urge humanists of all stripes to stop imagining that text mining has to prove its worth by proving literary theses.

That should not be our goal. Full-text search engines don’t perform literary analysis at all. Nor do they prove anything. But literary scholars find them indispensable: in fact, I would argue that search engines are at least partly responsible for the historicist turn in recent decades. If we take the same technology used in those engines (a term-document matrix plus vector space math), and just turn the matrix on its side so that it measures the strength of association between terms rather than documents, we will have a new tool that is equally valuable for literary historians. It won’t prove any thesis by itself, but it can uncover a whole new range of literary questions — and that, it seems to me, ought to be the goal of text mining.

References
Frances Burney, Cecilia; or, Memoirs of an Heiress, ed. Peter Sabor and Margaret Ann Doody (Oxford: OUP, 1999).
Sophia Lee, The Recess; or, a Tale of Other Times, ed. April Alliston (Lexington: UP of Kentucky, 2000).

[Postscript: This post, originally titled “How to make text mining serve literary history,” is a version of a talk I gave at NASSR 2011 in Park City, Utah, sharpened by the discussion that took place afterward. I’d like to thank the organizers of the conference (Andrew Franta and Nicholas Mason) as well as my co-panelists (Mark Algee-Hewitt and Mark Schoenfield) and everyone in the audience. The original slides are here; sometimes PowerPoint can be clearer than prose.

I don’t mean to deny, by the way, that the simple tools I’m using could be refined in many ways — e.g., they could include collocations. What I’m saying is I that don’t think we need to wait for technical refinements. Our text-mining tools are already sophisticated enough to produce valuable leads, and even after we make them more sophisticated, it will remain true that at some point in the critical process we have to get off the bicycle and walk.]

18c 19c visualization

The history of an association, part two.

Post author By tedunderwood
Post date May 6, 2011
No Comments on The history of an association, part two.

Here’s another attempt to animate the history of a cluster of associated words — this time as a force-directed graph that folds and unfolds itself as the window of time moves forward, and changing strengths of association create different tensions in the graph.

I had a lot of fun making this clip, but I don’t want to make exaggerated claims for it. These images might not mean very much to me if I hadn’t also read some of the books on which they’re based. The visualization only took a day to build, though, and I think it might turn out to be a useful brainstorming tool. In this instance the clip got me thinking about the different ways time is imagined in the “terror gothic” and in the “horror gothic.”

Association between words is measured here using a vector space model and a collection of more than five hundred works of British fiction. I realize it may seem strange that associations can form and disappear while an eighty-year search window moves forward only sixty years — at the end of this clip the cluster is disappearing while the window still overlaps with the period where the cluster started to emerge. It’s worth recalling that the model isn’t counting words, but measuring the association between them. An early-eighteenth-century work that didn’t use sentimental language at all would do nothing to dilute the association between sentimental terms. But a group of nineteenth-century works that used the same language differently could rapidly obscure earlier patterns.

In short, I suspect that the language of temporal immediacy (“moment,” “suddenly,” “immediately,” and so on) is strongly associated with feeling in the 18c in part because gothic novels, and novels of sensibility, just get to it first. In the nineteenth century other kinds of fiction may take up the same temporal language, diluting its specific connection to tremulous feeling. I can’t prove it yet, but the clues I’m seeing do point in that general direction.

18c 19c visualization

The history of an association.

Post author By tedunderwood
Post date May 4, 2011
No Comments on The history of an association.

[Update May 6th, 2011: The problem I describe here is solved a bit more effectively in a more recent post.] It’s fairly easy to visualize a cluster of associated words. But I’d also like to understand how these associations change, and visualizing that is trickier. For one thing, it’s not easy to define what it means to trace “the same” cluster across time; we need an approach that remains open to the possibility that a particular set of associations could simply weaken or dissolve. The video I’ve embedded below is a first, tentative stab at the problem. Move your mouse pointer away after clicking “play” to see the image without cropping.

I’m trying to understand a late-eighteenth-century convergence between the language of temporality and of feeling. Two words that seemed particularly strongly connected were “moment” and “felt.” So what I’ve done is to proceed five years at a time through a 200-year-long corpus, looking at 80-year-long windows from the corpus. In each “snapshot,” I select the twelve words that associate most strongly in vector space with a vector that’s composed of both “moment” and “felt.” In order to graph them on a coordinate plane, I also measure their association with each term separately. The y axis is association with “moment,” and the x axis is association with “felt.” The reference terms themselves are also plotted. This gives me a way to visualize strength of association in the whole cluster — basically, as everything gets closer to the upper-right-hand corner, the strength of association is getting stronger. At the same time we can get a general sense of the semantic character of the cluster.

I’m working with a relatively small collection here — 538 works of British fiction stretched out between 1700 and 1900. I have a larger 18th-century collection, but in this case I needed continuity over a longer span of time, and in order to achieve that I had to limit the collection to fiction, which reduced its size. It also means that the selection of words you’ll see here is different from the selection of words you saw in previous posts about the “felt-moment” convergence, which were based on a generically diverse collection.

Some of the things that are awkward about this video are consequences of the small collection size. For instance, given the small collection size, I have to choose a pretty long window (80 years out of an overall 200-year-long collection). The window is a bit shorter than that at the beginning of the video — for purely dramatic reasons, so that we don’t reach the “climax” of the clip too rapidly.

Also, of course, the stop-motion animation is rather jerky. With a larger collection, I think it might actually be possible to watch these terms move across the coordinate plane in a smooth and connected fashion. But given the small collection size, smooth motion would be illusory; the data don’t really support that level of precision.

However, even with all those caveats, I feel I’m learning something from the exercise. I think we are glimpsing the transformation of an associative cluster, and looking at the way it changes across time makes me more than ever suspect that — at the moment when it’s strongest — it has something to do with the way late-eighteenth-century fiction imagines suspense. “Anxiety” and “agitation” are durable presences, often in the upper-right-hand corner of the cluster. This interpretation is also, of course, based on reading some of the relevant works, and I think the next stage in exploring the question will be to go back and read them again. As always, I’m inclined to present text-mining more as an exploratory tool or brainstorming technique than as definitive evidence.

It is also a bit interesting to watch the language of gothic agitation turn into language of middle-class striving as we get into the nineteenth century. The intersection between “moment” and “felt” is increasingly occupied not by trembling but by terms like “energy,” “effort,” and “struggle.” I’m not quite sure what to make of that trajectory. Perhaps it helps explain the dissolution of the earlier cluster.

Another way of visualizing clusters like this might be to group terms in a force-directed graph and animate the evolution of the graph across time.

18c 19c methodology ngrams topic modeling

Trends, topics, and trending topics.

Post author By tedunderwood
Post date April 19, 2011
3 Comments on Trends, topics, and trending topics.

I’ve developed a text-mining strategy that identifies what I call “trending topics” — with apologies to Twitter, where the term is used a little differently. These are diachronic patterns that I find practically useful as a literary historian, although they don’t fit very neatly into existing text-mining categories.

A “topic,” as the term is used in text-mining, is a group of words that occur together in a way that defines a thematic focus. Cameron Blevin’s analysis of Martha Ballard’s diary is often cited as an example: Blevin identifies groups of words that seem to be associated, for instance, with “midwifery,” “death,” or “gardening,” and tracks these topics over the course of the diary.

“Trends” haven’t received as much attention as topics, but we need some way to describe the pattern that Google’s ngram viewer has made so visible, where groups of related words rise and fall together across long periods of time. I suspect “trend” is as a good a name for this phenomenon as we’ll get.

2011colors — blue, red, green, yellow, in the English corpus 1750-2000

From 1750 to 1920, the prominence of color vocabulary increases by a factor of three, for instance: and when it does, the names of different colors track each other very closely. I would call this a trend. Moreover, it’s possible to extend the principle that conceptually related words rise and fall together beyond cases like the colors and seasons where we’re dealing with an obvious physical category.

AnimAttenArdour — Google data graphed with my own viewer; if you compare this to Google's viewer, remember that I'm merging capitalized and uncapitalized forms, as well as ardor/ardour.

“Animated,” “attentive,” and “ardour” track each other almost as closely as the names of primary colors (the correlation coefficients are around 0.8), and they characterize conduct in ways that are similar enough to suggest that we’re looking at the waxing and waning not just of a few random words, but of a conceptual category — say, a particular sort of interest in states of heightened receptiveness or expressivity.

I think we could learn a lot by thoughtfully considering “trends” of this sort, but it’s also a kind of evidence that’s not easy to interpret, and that could easily be abused. A lot of other words correlate almost as closely with “attentive,” including “propriety,” “elegance,” “sentiments,” “manners,” “flattering,” and “conduct.” Now, I don’t think that’s exactly a random list (these terms could all be characterized loosely as a discourse of manners), but it does cover more conceptual ground than I initially indicated by focusing on words like “animated” and “ardour.” And how do we know that any of these terms actually belonged to the same “discourse”? Perhaps the books that talked about “conduct” were careful not to talk about “ardour”! Isn’t it possible that we have several distinct discourses here that just happened to be rising and falling at the same time?

In order to answer these questions, I’ve been developing a technique that mines “trends” that are at the same time “topics.” In other words, I look for groups of terms that hold together both in the sense that they rise and fall together (correlation across time), and in the sense that they tend to be common in the same documents (co-occurrence). My way of achieving this right now is a two-stage process: first I mine loosely defined trends from the Google ngrams dataset (long lists of, say, one hundred closely correlated words), and then I send those trends to a smaller, generically diverse collection (including everything from sermons to plays) where I can break the list into clusters of terms that tend to occur in the same kinds of documents.

I do this with the same vector space model and hierarchical clustering technique I’ve been using to map eighteenth-century diction on a larger scale. It turns the list of correlated words into a large, branching tree. When you look at a single branch of that tree you’re looking at what I would call a “trending topic” — a topic that represents, not a stable, more-or-less-familiar conceptual category, but a dynamically-linked set of concepts that became prominent at the same time, and in connection with each other.

MannersTree — one branch of a tree created by finding words that correlate with "manners," and then clustering them based on co-occurrence in 18c books

Here, for instance, is a branch of a larger tree that I produced by clustering words that correlate with “manners” in the eighteenth century. It may not immediately look thematically coherent. We might have expected “manners” to be associated with words like “propriety” or “conduct” (which do in fact correlate with it over time), but when we look at terms that change in correlated ways and occur in the same volumes, we get a list of words that are largely about wealth and rank (“luxury,” “opulence,” “magnificence”), as well as the puzzling “enervated.” To understand a phenomenon like this, you can simply reverse the process that generated it, by using the list as a search query in the eighteenth-century collection it’s based on. What turned up in this case were, pre-eminently, a set of mid-eighteenth-century works debating whether modern commercial opulence, and refinements in the arts, have had an enervating effect on British manners and civic virtue. Typical examples are John Brown’s Estimate of the Manners and Principles of the Times (1757) and John Trusler’s Luxury no Political Evil but Demonstratively Proved to be Necessary to the Preservation and Prosperity of States (1781). I was dimly aware of this debate, but didn’t grasp how central it became to debate about manners, and certainly wasn’t familiar with the works by Brown and Trusler.

I feel like this technique is doing what I want it to do, practically, as a literary historian. It makes the ngram viewer something more than a provocative curiosity. If I see an interesting peak in a particular word, I can can map the broader trend of which it’s a part, and then break that trend up into intersecting discourses, or individual works and authors.

Admittedly, there’s something inelegant about the two-stage process I’m using, where I first generate a list of terms and then use a smaller collection to break the list into clusters. When I discussed the process with Ben Schmidt and Miles Efron, they both, independently, suggested that there ought to be some simpler way of distinguishing “trends” from “topics” in a single collection, perhaps by using Principal Component Analysis. I agree about that, and PCA is an intriguing suggestion. On the other hand, the two-stage process is adapted to the two kinds of collections I actually have available at the moment: on the one hand, the Google dataset, which is very large and very good at mapping trends with precision, but devoid of metadata, on the other hand smaller, richer collections that are good at modeling topics, but not large enough to produce smooth trend lines. I’m going to experiment with Principal Component Analysis and see what it can do for me, but in the meantime — speaking as a literary historian rather than a computational linguist — I’m pretty happy with this rough-and-ready way of identifying trending topics. It’s not an analytical tool: it’s just a souped-up search technology that mines trends and identifies groups of works that could help me understand them. But as a humanist, that’s exactly what I want text mining to provide.

18c 19c linguistics

“… a selection of the language really spoken by men”?

Post author By tedunderwood
Post date March 17, 2011
4 Comments on “… a selection of the language really spoken by men”?

William Wordsworth’s claim to have brought poetry back to “the language of conversation in the middle and lower classes of society” gets repeated to each new generation of students (1). But did early nineteenth-century writing in general become more accessible, or closer to speech? It’s hard to say. We’ve used remarks like Wordsworth’s to anchor literary history, but we haven’t had a good way to assess their representativeness.

Increasingly, though, we’re in a position to test some familiar stories about literary history — to describe how the language of one genre changed relative to others, or even relative to “the language of conversation.” We don’t have eighteenth-century English speakers to interview, but we do have evidence about the kinds of words that tend to be more common in spoken language. For instance, Laly Bar-Ilan and Ruth Berman have shown in the journal Linguistics that contemporary spoken English is distinguished from writing by containing a higher proportion of words from the Old English part of the lexicon (2). This isn’t terribly surprising, since English was for a couple of hundred years (1066-1250) almost exclusively a spoken language, while French and Latin were used for writing. Any word that entered English before this period, and survived, had to be the kind of word that gets used in conversation. Words that entered afterward were often borrowed from French or Latin to flesh out the written language.

If the spoken language was distinguished from writing this way in the thirteenth century, and the same thing holds true today, then one might expect it to hold true in the eighteenth and nineteenth centuries as well. And it does seem to hold true: eighteenth-century drama, written to be spoken on stage, is distinguished from nondramatic poetry and prose by containing a higher proportion of Old English words. This is a broad-brush approach to diction, and not one that I would use to describe individual works. But applied to an appropriately large canvas, it may give us a rough picture of how the “register” of written diction has changed across time, becoming more conversational or more formal.

This graph is based on a version of the Google English corpus that I’ve cleaned up in a number of ways. Common OCR errors involving s, f, and ct have been corrected. The graph shows the aggregate frequency of the 500 most common English words that entered the language before the twelfth century. (I’ve found date-of-entry a more useful metric of a word’s affinity with spoken language than terms like “Latinate” or “Germanic.” After all, “Latinate” words like “school,” “street,” and “wall” don’t feel learned to us, because they’ve been transmitted orally for more than a millennium.) I’ve excluded a list of stopwords that includes determiners, prepositions, pronouns, and conjunctions, as well as the auxiliary verbs “be,” “will,” and “have.”

In relative terms, the change here may not look enormous; the peak in the early eighteenth century (181 words per thousand) is only about 20% higher than the trough in the late eighteenth century (152 words per thousand). But we’re talking about some of the most common words in the language (can, think, do, self, way, need, know). It’s a bit surprising that this part of the lexicon fluctuates at all. You might expect to see a gradual decline in the frequency of these words, as the overall size of the lexicon increases. But that’s not what happens: instead we see a rapid decline in the eighteenth century (as prose becomes less like speech, or at least less like the imagined speech of contemporaneous drama), and then a gradual recovery throughout the nineteenth century.

What does this tell us about literature? Not much, without information about genre. After all, as I mentioned, dramatic writing is a lot closer to speech than, say, poetry is. This curve might just be telling us that very few plays got written in the late eighteenth century.

Fortunately it’s possible to check the Google corpus against a smaller corpus of individual texts categorized by genre. I’ve made an initial pass at the first hundred years of this problem using a corpus of 2,188 eighteenth-century books produced by ECCO-TCP, which I obtained in plain text with help from Laura Mandell and 18thConnect. Two thousand books isn’t a huge corpus, especially not after you divide them up by genre, so these results are only preliminary. But the initial results seem to confirm that the change involved the language of prose itself, and not just changes in the relative prominence of different genres. Both fiction and nonfiction prose show a marked change across the century. If I’m right that the frequency of pre-12c words is a fair proxy for resemblance to spoken language, they became less and less like speech.

“Fiction” is of course a fuzzy category in the eighteenth century. The blurriness of the boundary between a sensationalized biography and a “novel” is a lot of the point of being a novel. In the graph above, I’ve lumped biographies and collections of personal letters in with novels, because I’m less interested in distinguishing something unique about fiction than I am in confirming a broad change in the diction of nondramatic prose.

By contrast, there’s relatively little change in the diction of poetry and drama. The proportion of pre-twelfth-century words is roughly the same at the end of the century as it was at the beginning.

Are these results intuitive, or are they telling us something new? I think the general direction of these curves probably confirms some intuitions. Anyone who studies eighteenth and nineteenth-century English knows that you get a lot of long words around 1800. Sad things become melancholy, needs become a necessity, and so on.

What may not be intuitive is how broad and steady the arc of change appears to be. To the extent that we English professors have any explanation for the elegant elaboration of late-eighteenth-century prose, I think we tend to blame Samuel Johnson. But these graphs suggest that much of the change had already taken place by the time Johnson published his Dictionary. Moreover, our existing stories about the history of style put a lot of emphasis on poetry — for instance, on Wordsworth’s critique of poetic diction. But the biggest changes in the eighteenth century seem to have involved prose rather than poetry. It’ll be interesting to see whether that holds true in the nineteenth century as well.

How do we explain these changes? I’m still trying to figure out. In the next couple of weeks I’ll write a post asking what took up the slack: what kinds of language became common in books where old, common words were relatively underrepresented?

—– references —–
1) William Wordsworth and Samuel T. Coleridge, Lyrical Ballads, with a Few Other Poems (Bristol: 1798), i.
2) Laly Bar-Ilan and Ruth A. Berman, “Developing register differentiation: the Latinate-Germanic divide in English,” Linguistics 45 (2007): 1-35.