The Stone and the Shell

Exploring the relationship between topics and trends.

Post author By tedunderwood
Post date November 18, 2011
1 Comment on Exploring the relationship between topics and trends.

I’ve been talking about correlation since I started this blog. Actually, that was the reason why I did start it: I think literary scholars can get a huge amount of heuristic leverage out of the fact that thematically and socially-related words tend to rise and fall together. It’s a simple observation, and one that stares you in the face as soon as you start to graph word frequencies on the time axis.¹ But it happens to be useful for literary historians, because it tends to uncover topics that also pose periodizable kinds of puzzles. Sometimes the puzzle takes the form of a topic we intuitively recognize (say, the concept of “color”) that increases or decreases in prominence for reasons that remain to be explained:

At other times, the connection between elements of the topic is not immediately intuitive, but the terms are related closely enough that their correlation suggests a pattern worthy of further exploration. The relationship between terms may be broadly historical:

Or it may involve a pattern of expression that characterizes a periodizable style:

Of course, as the semantic relationship between terms becomes less intuitively obvious, scholars are going to wonder whether they’re looking at a real connection or merely an accidental correlation. “Ardent” and “tranquil” seem like opposites; can they really be related as elements of a single discourse? And what’s the relationship to “bosom,” anyway?

Ultimately, questions like this have to be addressed on a case-by-case basis; the significance of the lead has to be fleshed out both with further analysis, and with close reading.

But scholars who are wondering about the heuristic value of correlation may be reassured to know that this sort of lead does generally tend to pan out. Words that correlate with each other across the time axis do in practice tend to appear in the same kinds of volumes. For instance, if you randomly select pairs of words from the top 10,000 words in the Google English ngrams dataset 1700-1849,² measure their correlation with each other in that dataset across the period 1700-1849, and then measure their tendency to appear in the same volumes in a different collection³ (taking the cosine similarity of term vectors in a term-document matrix), the different measures of association correlate with each other strongly. (Pearson’s r is 0.265, significant at p < 0.0005.) Moreover, the relationship holds (less strongly, but still significantly) even in adjacent centuries: words that appear in the same eighteenth-century volumes still tend to rise and fall together in the nineteenth century.

Why should humanists care about the statistical relationship between two measures of association? It means that correlation-mining is in general going to be a useful way of identifying periodizable discourses. If you find a group of words that correlate with each other strongly, and that seem related at first glance, it's probably going to be worthwhile to follow up the hunch. You’re probably looking at a discourse that is bound together both diachronically (in the sense that the terms rise and fall together) and topically (in the sense that they tend to appear in the same kinds of volumes).

Ultimately, literary historians are going to want to assess correlation within different genres; a dataset like Google's, which mixes all genres in a single pool, is not going to be an ideal tool. However, this is also a domain where size matters, and in that respect, at the moment, the ngrams dataset is very helpful. It becomes even more helpful if you correct some of the errors that vitiate it in the period before 1820. A team of researchers at Illinois and Stanford⁴, supported by the Andrew W. Mellon Foundation, has been doing that over the course of the last year, and we're now able to make an early version of the tool available on the web. Right now, this ngram viewer only covers the period 1700-1899, but we hope it will be useful for researchers in that period, because it has mostly corrected the long-s problem that confufes opt1cal charader readers in the 18c — as well as a host of other, less notorious problems. Moreover, it allows researchers to mine correlations in the top 10,000 words of the lexicon, instead of trying words one by one to see whether an interesting pattern emerges. In the near future, we hope to expand the correlation miner to cover the twentieth century as well.

For further discussion of the statistical relationship between topics and trends, see this paper submitted to DHCS 2011.

UPDATE Nov 22, 2011: At DHCS 2011, Travis Brown pointed out to me that Topics Over Time (Wang and McCallum) might mine very similar patterns in a more elegant, generative way. I hope to find a way to test that method, and may perhaps try to build an implementation for it myself.

References
1) Ryan Heuser and I both noticed this pattern last winter. Ryan and Long Le-Khac presented on a related topic at DH2011: Heuser, Ryan, and Le-Khac, Long. “Abstract Values in the 19th Century British Novel: Decline and Transformation of a Semantic Field,” Digital Humanities 2011, Stanford University.

2) Jean-Baptiste Michel*, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, William Brockman, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak, and Erez Lieberman Aiden*. “Quantitative Analysis of Culture Using Millions of Digitized Books.” Science (Published online ahead of print: 12/16/2010)

3) The collection of 3134 documents (1700-1849) I used for this calculation was produced by combining ECCO-TCP volumes with nineteenth-century volumes selected and digitized by Jordan Sellers.

4) The SEASR Correlation Analysis and Ngrams Viewer was developed by Loretta Auvil and Boris Capitanu at the Illinois Informatics Institute, modeled on prototypes built by Ted Underwood, University of Illinois, and Ryan Heuser, Stanford.

linguistics math methodology

Identifying diction that characterizes an author or genre: why Dunning’s may not be the best method.

Post author By tedunderwood
Post date November 9, 2011
14 Comments on Identifying diction that characterizes an author or genre: why Dunning’s may not be the best method.

Most of what I’m about to say is directly lifted from articles in corpus linguistics (1, 2), but I don’t think these results have been widely absorbed yet by people working in digital humanities, so I thought it might be worthwhile to share them, while demonstrating their relevance to literary topics.

The basic question is just this: if I want to know what words or phrases characterize an author or genre, how do I find out? As Ben Schmidt has shown in an elegantly visual way, simple mathematical operations won’t work. If you compare ratios (dividing word frequencies in the genre A that interests you by the frequencies in a corpus B used as a point of comparison), you’ll get a list of very rare words. But if you compare the absolute magnitude of the difference between frequencies (subtracting B from A), you’ll get a list of very common words. So the standard algorithm that people use is Dunning’s log likelihood,

— a formula that incorporates both absolute magnitude (O is the observed frequency) and a ratio (O/E is the observed frequency divided by the frequency you would expect). For a more complete account of how this is calculated, see Wordhoard.

But there’s a problem with this measure, as Adam Kilgarriff has pointed out (1, pp. 237-38, 247-48). A word can be common in a corpus because it’s very common in one or two works. For instance, when I characterize early-nineteenth-century poetic diction (1800-1849) by comparing a corpus of 60 volumes of poetry to a corpus of fiction, drama, and nonfiction prose from the same period (3), I get this list:

Much of this looks like “poetic diction” — but “canto” is poetic diction only in a weird sense. It happens to be very common in a few works of poetry that are divided into cantos (works for instance by Lord Byron and Walter Scott). So when everything is added up, yes, it’s more common in poetry — but it doesn’t broadly characterize the corpus. Similar problems occur for a range of other reasons (proper nouns and pronouns can be extremely common in a restricted context).

The solution Kilgarriff offers is to instead use a Mann-Whitney ranks test. This allows us to assess how consistently a given term is more common in one corpus than in another. For instance, suppose I have eight text samples of equal length. Four of them are poetry, and four are prose. I want to know whether “lamb” is significantly more common in the poetry corpus than in prose. A simple form of the Mann-Whitney test would rank these eight samples by the frequency of “lamb” and then add up their respective ranks:

Since most works of poetry “beat” most works of prose in this ranking, the sum of ranks for poetry is higher, in spite of the 31 occurrences of lamb in one work of prose — which is, let us imagine, a novel about sheep-rustling in the Highlands. But a log-likelihood test would have identified this word as more common in prose.

In reality, one never has “equal-sized” documents, but the test is not significantly distorted if one simply replaces absolute frequency with relative frequency (normalized for document size). (If one corpus has on average much smaller documents than the other does, there may admittedly be a slight distortion.) Since the number of documents in each corpus is also going to vary, it’s useful to replace the rank-sum (U) with a statistic ρ (Mann-Whitney rho) that is U, divided by the product of the sizes of the two corpora.

Using this measure of over-representation in a corpus produces a significantly different model of “poetic diction”:

This looks at first glance like a better model. It demotes oddities like “canto,” but also slightly demotes pronouns like “thou” and “his,” which may be very common in some works of poetry but not others. In general, it gives less weight to raw frequency, and more weight to the relative ubiquity of a term in different corpora. Kilgarriff argues that the Mann-Whitney test thereby does a better job of identifying the words that characterize male and female conversation (1, pp. 247-48).

On the other hand, Paul Rayson has argued that by reducing frequency to a rank measure, this approach discards “most of the evidence we have about the distribution of words” (2). For linguists, this poses an interesting, principled dilemma, where two statistically incompatible definitions of “distinctive diction” are pitted against each other. But for a shameless literary hack like myself, it’s no trouble to cut the Gordian knot with an improvised algorithm that combines both measures. For instance, one could multiply rho by the log of Dunning’s log likelihood (represented here as G-squared) …

I don’t yet know how well this algorithm will perform if used for classification or authorship attribution. But it does produce what is for me an entirely convincing portrait of early-nineteenth-century poetic diction:

Of course, once you have an algorithm that convincingly identifies the characteristic diction of a particular genre relative to other publications in the same period, it becomes possible to say how the distinctive diction of a genre is transformed by the passage of time. That’s what I hope to address in my next post.

UPDATE Nov 10, 2011: As I continue to use these tests in different ways (using them e.g. to identify distinctively “fictional” diction, and to compare corpora separated by time) I’m finding the Mann-Whitney ρ measure more and more useful on its own. I think my urge to multiply it by Dunning’s log-likelihood may have been the needless caution of someone who’s using an unfamiliar metric and isn’t sure yet whether it will work unassisted.

References
(1) Adam Kilgarriff, “Comparing Corpora,” International Journal of Corpus Linguistics 6.1 (2001): 97-133.
(2) Paul Rayson, Matrix: A Statistical Method and Software Tool for Linguistic Analysis through Corpus Comparison. Unpublished Ph.D thesis, Lancaster University, 2003, p. 47. Cited in Magali Paquot and Yves Bestgen, “Distinctive words in academic writing: A comparison of three statistical tests for keyword extraction,” Corpora: Pragmatics and Discourse Papers from the 29th International Conference on English Language Research on Computerized Corpora (ICAME 29), Ascona, Switzerland, 14-18 May 2008, p. 254.
(3) The corpora used in this post were selected by Jordan Sellers, mostly from texts available in the Internet Archive, and corrected with a Python script described in this post.

undigitized humanities

On transitive and intransitive uses of the verb “theorize.”

Post author By tedunderwood
Post date October 22, 2011
6 Comments on On transitive and intransitive uses of the verb “theorize.”

I’m a relative newcomer to digital humanities; I’ve been doing this for about a year now. The content of the field has been interesting, but in some ways even more interesting is the way it has transformed my perception of the academy as a social structure. There are clearly going to be debates over the next few years between more and less digitized humanists, and debate is probably a good thing for everyone. But the debate can be much more illuminating if we acknowledge up front that it’s also a tension between two different forms of social organization.

Here’s what happens when that dimension of the issue goes unacknowledged: a tenured or tenure-track faculty member will give a talk or write a blog post about the digital humanities, saying essentially “you’ve got some great tools there, but before they can really matter, their social implications need to be theorized more self-consciously.” Said professor is then surprised when the librarians, or academic professionals, or grad students, who have in many cases designed and built those tools reply with a wry look.

The reason for this, as Miriam Posner recently tweeted, is that “theory has been the province of scholars,” while “the work of DH has been done by staff.” So when you say “those tools need to be theorized,” you are in effect saying “those tools need to be appropriated or regulated by someone like me.” That’s, so to speak, the social implication.

I hasten to add that I’ve got nothing against theories. I wouldn’t mind constructing a few myself. Literary theory, social theory, statistical theory — they’re all fun. But when the word “Theory” is used without adjective or explication, it does in my view deserve a wry look. When you take away all the adjectives, what’s left is essentially a status marker.

So let’s not play that game. Nothing “needs to be theorized” in a vague transitive way; academics who use phrases like that need to realize what they’re saying. DH is an intensely interdisciplinary field that already juggles several different kinds of theory, and actively reflects on the social significance of its endeavors (e.g. in transforming scholarly communication). It is also, among other things, an insurgent challenge to academic hierarchy, organized and led by people who often hold staff positions — which means that the nature of the boundary between practice and theory is precisely one of the questions it seeks to contest.

But as long as everyone understands that “theory” is not a determinate object belonging to a particular team, then I say, the more critique, debate, and intellectual exchange the better. For instance, I quite enjoyed Natalia Cecire’s recent blog post on ways DH could frame its engagement with literary theory more ambitiously. I don’t know whether it’s a good idea to have a “theory THATcamp”; I haven’t been to THATcamp, and don’t know whether its strengths (which seem to lie in collaboration) are compatible with that much yacking. But I think Cecire is absolutely right to insist that DH can and should change the way the humanities are practiced. Because digital approaches make it possible to ask and answer different kinds of questions, there’s going to be a reciprocal interaction between humanistic goals and digital methods, not, as Cecire puts it, a “merely paratactic, additive concatenation.” We’re going to need to theorize about methods and goals at the same time. Together. Intransitively.

[Sun, Oct 23, 2011 — This post is slightly revised from the original version, mostly for clarity.]

18c LSA math methodology topic modeling

LSA is a marvellous tool, but literary historians may want to customize it for their own discipline.

Post author By tedunderwood
Post date October 16, 2011
9 Comments on LSA is a marvellous tool, but literary historians may want to customize it for their own discipline.

Right now Latent Semantic Analysis is the analytical tool I’m finding most useful. By measuring the strength of association between words or groups of words, LSA allows a literary historian to map themes, discourses, and varieties of diction in a given period.

This approach, more than any other I’ve tried, turns up leads that are useful for me as a literary scholar. But when I talk to other people in digital humanities, I rarely hear enthusiasm for it. Why doesn’t LSA get more love? I see three reasons:

1. The word “semantic” is a false lead: it points away from the part of this technique that would actually interest us. It’s true that Latent Semantic Analysis is based on the observation that a word’s distribution across a collection of documents works remarkably well as a first approximation of its meaning. A program running LSA can identify English synonyms on the TOEFL as well as the average student applying to college from a non-English-speaking country. [1]

But for a literary historian, the value of this technique does not depend on its claim to identify synonyms and antonyms. We may actually be more interested in contingent associations (e.g., “sensibility” — “rousseau” in the list on the left) than we are in the core “meaning” of a word.

I’ll return in a moment to this point. It has important implications, because it means that we want LSA to do something slightly different than linguists and information scientists have designed it to do. The “flaws” they have tried to iron out of the technique may not always be flaws for our purposes.

2. People who do topic-modeling may feel that they should use more-recently-developed Bayesian methods, which are supposed to be superior on theoretical grounds. I’m acknowledging this point just to set it aside; I’ve mused out loud about it once already, and I don’t want to do more musing until I have rigorously compared the two methods. I will say that from the perspective of someone just getting started, LSA is easier to implement than Bayesian topic modeling: it runs faster and scales up more easily.

3. The LSA algorithm provided by an off-the-shelf package is not necessarily the best algorithm for a literary historian. At bottom, that’s why I’m writing this post: humanists who want to use LSA are going to need guidance from people in their own discipline. Computer scientists do acknowledge that LSA requires “tuning, which is viewed as a kind of art.” [2] But they also offer advice about “best practices,” and some of those best practices are defined by disciplinary goals that humanists don’t share.

For instance, the power of LSA is often said to come from “reducing the dimensionality of the matrix.” The matrix in question is a term-document matrix — documents are listed along one side of the matrix, and terms along the other, and each cell of the matrix (tf_i,j) records the number of times term i appears in document j, modified by a weighting algorithm described at the end of this post.

That term-document matrix in and of itself can tell you a lot about the associations between words; all you have to do is measure the similarity between the vectors (columns of numbers) associated with each term. But associations of this kind won’t always reveal synonyms. For instance, “gas” and “petrol” might seem unrelated, because they substitute for each other in different sociolects and are rarely found together. To address that problem, you can condense the matrix by factorizing it with a technique called singular value decomposition (SVD). I’m not going to get into the math here, but the key is that condensing the matrix partially fuses related rows and columns — and as a result, the compressed matrix is able to measure transitive kinds of association. The words “gas” and “petrol” may rarely appear together. But they both appear with the same kinds of other words. So when dimensionality reduction “merges” the rows representing similar documents, “gas” and “petrol” will end up being strongly represented in the same merged rows. A compressed matrix is better at identifying synonyms, and for that reason at information retrieval. So there is a lot of consensus among linguists and information scientists that reducing the number of dimensions in the matrix is a good idea.

But literary historians approach this technique with a different set of goals. We care a lot about differences of sociolect and register, and may even be more interested in those sources of “noise” than we are in purely semantic relations. “Towering,” for instance, is semantically related to “high.” But I could look that up in a dictionary; I don’t need a computer program to tell me that! I might be more interested to discover that “towering” belongs to a particular subset of poetic diction in the eighteenth century. And that is precisely the kind of accident of distribution that dimensionality-reduction is designed to filter out. For that reason, I don’t think literary applications of LSA are always going to profit from the dimensionality-reduction step that other disciplines recommend.

For about eight months now, I’ve been using a version of LSA without dimensionality reduction. It mines associations simply by comparing the cosine-similarity of term vectors in a term-document matrix (weighted in a special way to address differences of document size). But I wanted to get a bit more clarity about the stakes of that choice, so recently I’ve been comparing it to a version of LSA that does use SVD to compress the matrix.

Comparing 18c associations for "delicacy" generated by two different algorithms.

Here’s a quick look at the results. (I’m using 2,193 18c volumes, mostly produced by TCP-ECCO; volumes that run longer than 100,000 words get broken into chunks that can range from 50k-100k words.) In many cases, the differences between LSA with and without compression are not very great. In the case of “delicacy,” for instance, both algorithms indicate that “delicate” has the strongest association. “Politeness” and “tenderness” are also very high on both lists. But compare the second row. The algorithm with compression produces “sensibility” — a close synonym. On the left-hand side, we have “woman.” This is not a synonym for “delicacy,” and if a linguist or computer scientist were evaluating these algorithms, it would probably be rejected as a mistake. But from a literary-historical point of view, it’s no mistake: the association between “delicacy” and femininity is possibly the most interesting fact about the word.

without — The 18c associations of "high" and "towering," in an uncompressed term-document matrix.

In short, compressing the matrix with SVD highlights semantic relationships at the cost of slightly blurring other kinds of association. In the case of “delicacy,” the effect is fairly subtle, but in other cases the difference between the two approaches is substantial. For instance, if you measure the similarity of term vectors in a matrix without compression, “high” and “towering” look entirely different. The main thing you discover about “high” is that it’s used for physical descriptions of landscape (“lies,” “hills”), and the main thing you discover about “towering” is that it’s used in poetic contexts (“flowery,” “glittering”).

The 18c. associations of "high" and "towering," as measured in a term-document matrix that has undergone SVD compression.

In a matrix that has undergone dimensionality reduction with SVD, associations have a much more semantic character, although they are still colored by other dimensions of context. Which of these two algorithms is more useful for humanistic purposes? I think the answer is going to depend on the goals being pursued in a given research project — if you’re interested in “topics” that are strictly semantic, you might want to use an algorithm that reduces dimensionality with SVD. If you’re interested in discourses, sociolects, genres, or types of diction, you might use LSA without dimensionality reduction.

My purpose here isn’t to choose between those approaches; it’s just to remind humanists that the algorithms we borrow from other disciplines are often going to need to be customized for our own disciplinary purposes. Information scientists have designed topic-modeling algorithms that produce semantically unified topics, because semantic categorization is important for them. But in literary history, we also care about other dimensions of language, and we don’t have to judge topic-modeling algorithms by strictly semantic criteria. How should we judge them? It will probably take decades for us to answer that question fully, but the short answer is just — by how well, in practice, they help us locate critically and historically interesting patterns.

A couple of technical notes: A fine point of LSA that can matter a great deal is how you weight the individual cells in the term-document matrix. For the normal LSA algorithm that uses dimensionality reduction, the consensus is that “log-entropy weighting” works well. You take the log of each frequency, and multiply the whole term vector by the entropy of the vector. I have found that this also works well for humanistic purposes.

For LSA without dimensionality reduction, I would recommend weighting cells by subtracting the expected frequency from the observed frequency. This formula “evens the playing field” between common and uncommon words — and it does so, vitally, in a way that gives a word’s absence from a long document more weight than its absence from a short one. (Much of LSA’s power actually comes from learning where a given word tends not to appear. [3]) I have tried various ways of applying log-entropy weighting without compressing the matrix, and I do not recommend it. Those two techniques belong together.

For reasons that remain somewhat mysterious (although the phenomenon itself is widely discussed), dimensionality reduction seems to work best when the number of dimensions retained is in the range of 250-350. Intuitively, it would seem possible to strike a sort of compromise between LSA methods that do and don’t compress the matrix by reducing dimensionality less drastically (perhaps only, say, cutting it by half). But in practice I find that doesn’t work very well; I suspect compression has to reach a certain threshold before the noise inherent in the process starts to cancel itself out and give way to a new sort of order.

[1] Thomas K. Landauer, Peter W. Foltz, and Darrell Latham, An Introduction to Latent Semantic Analysis, Discourse Processes 25 (1998): 259-84. Web reprint, p. 22.
[2] Preslav Nakov, Elena Valchanova, and Galia Angelova, “Towards Deeper Understanding of Latent Sematic Analysis Performance,” Recent Advances in Natural Language Processing, ed. Nicolas Nicolov (Samokov, Bulgaria: John Benjamins, 2004), 299.
[3] Landauer, Foltz, and Latham, p. 24.

18c 19c collection-building

The challenges of digital work on early-19c collections.

Post author By tedunderwood
Post date October 7, 2011
7 Comments on The challenges of digital work on early-19c collections.

I’ve been posting mostly about collections built by other people (TCP-ECCO and Google). But I’m also in the process of building a small (thousand-title) 19c collection myself, in collaboration with E. Jordan Sellers. Jordan is selecting titles for the collection; I’m writing the Python scripts that process the texts. This is a modest project intended to support research for a few years, not a model for long-term curatorial practice. But we’ve encountered a few problems specific to the early 19c, and I thought I might share some of our experience and tools in case they’re useful for other early-19c scholars.

Smellie2 — Literary and Characteristical Lives (1800), by William and Alexander Smellie. Note esp. the ligatures in 'first' and 'section.'

I originally wanted to create a larger collection, containing twenty or thirty thousand volumes, on the model of Ben Schmidt’s impressive work with nineteenth-century volumes vacuumed up from the Open Library. But because I needed a collection that bridged the eighteenth and nineteenth centuries, I found I had to proceed more slowly. The eighteenth century itself wasn’t the problem. Before 1800, archaic typography makes most optical character recognition unreliable — but for that very reason, TCP-ECCO has been producing clean, manually-keyed versions of 18c texts, enough at least for a small collection. The later 19c also isn’t a problem, because after 1830 or so, OCR quality is mostly adequate.

OCR2 — OCR version of Smellie, contributed by Columbia University Libraries to the Internet Archive.

But between 1800 and (say) 1830, you fall between two stools. It’s technically the nineteenth century, so people assume that OCR ought to work. But in practice, volumes from this period still have a lot of eighteenth-century typographical quirks, including loopy ligatures, the notorious “long s,” and worn or broken type. So the OCR is often pretty vile. I’m willing to put up with background noise if it’s evenly distributed. But these errors are distributed unevenly across the lexicon and across time, so they could actually distort conclusions if left unaddressed.

I decided to build a Python script to do post-processing correction of OCR. There are a lot of ways to do this; my approach was modeled on a paper written by Thomas A. Lasko and Susan E. Hauser for the National Library of Medicine. Briefly, what they show is that OCR correction becomes much more reliable when the program is given statistical information about the language, and errors, to be expected in a given domain. They’re working with contemporary text, but the principle holds even more strongly when you’re working in a different historical period. A generic spellchecker won’t perform well with texts that contain period spellings (“despatch,” “o’erflow’d”), systematic f/s substitution, and a much higher proportion of Latin and French than we’re used to. If your system corrects every occurrence of “même” to “mime,” you’re going to end up with a surprising number of mimes; if you accept “foul” at face value as a correctly-spelled word, you’re going to have very little “soul” in your collection.

Briefly, I customized my spellchecker for the early 19c in three ways:

A few other tricks are needed to optimize speed, and to make sure the script doesn’t over-correct proper nouns; anyone who’s interested in doing this should drop me a line for a fuller description and a copy of the code.

Corrected2 — Automatically corrected version.

The results aren’t perfect, but they’re good enough to be usable (I am also recording the number of corrections and uncorrectable tokens so that I can assess margins of error later on).

I haven’t packaged this code yet for off-the-shelf use; it’s still got a few trailing wires. But if you want to cannibalize/adapt it, I’d be happy to give you a copy. Perhaps more importantly, I’d like to share a couple of sets of rules that might be helpful for anyone who’s attempting to normalize an 18/19c collection. Both of these rulesets are tab-delimited utf-8 .txt files. First, my list of 4600 rules for correcting 18/19c spellings, including syncopated past-tense forms like “bury’d” and “drop’d.” (Note that syncope cannot always be fixed simply by adding back an “e.” Rules for normalizing poetic syncope — “flow’ry,” “ta’en” — are clustered at the end of the file, so you can delete them if desired.) This ruleset has been transformed by a long series of joins and filtering operations, and edited manually, but I should acknowledge that part of the original list was borrowed from the source files that accompany WordHoard, developed at Northwestern University. I should also warn potential users that these rules are designed to normalize spelling to modern British practice.

The other thing it might be useful to share is a list of 2grams extracted from the Google English corpus, that I use for contextual spellchecking. This includes only 2grams where one of the two elements is a token like “fix” or “flip” that could be read either as a valid word or as an OCR error caused by the long s. Since the long s is also a problem in the Google dataset itself up to 1820, this list was based on frequencies from 1825-50. That’s not perfect for correcting texts in the 1800-1820 period, but I find that in practice it’s adequate. There are two columns here: the 2gram itself, and the frequency.

methodology teaching

A course description.

I thought I would share the description of a graduate course I’ll be teaching in Spring 2012. It’s targeted specifically at students in English literature. So instead of teaching an “introduction to digital humanities” as a whole, I’ve decided to focus on the parts of this research program that seem to integrate most easily into literary study. I want to help students take risks — but I also want to focus, candidly, on risks that seem likely to produce useful credentials within the time frame of graduate study.

I think the perception among professors of literature may be that TEI-based editing is the digital tool that integrates most easily into what we do. But where grad students are concerned, I think new modes of collection-mapping are actually more widely useful, because they generate leads that can energize projects not otherwise centrally “digital.” This approach is technically a bit more demanding than TEI would be, but if students are handed a few simple modules (LSA-based topic modeling, Dunning’s log likelihood, collocation analysis, entity extraction, time series graphing) I think it’s fairly easy to reveal discourses, trends, and perhaps genres that no one has discussed. I’ll be sharing my own tools built in R, and an 18-19c collection I have developed in collaboration with E. Jordan Sellers. But I’ll also ask students to learn some basic elements of R themselves, so that they can adapt or connect modules and generate their own visualizations. As we get into problems that exceed the power of the average Mac, I’ll introduce students to the modular resources of SEASR. Wish us luck — it’s an experiment!

ENGL 581. Digital Tools and Critical Theory. Spring 2012.

Critical practice is already shaped by technology. Contemporary historicism emerged around the same time as full-text search, for instance, and would be hard to envision without it. Our goal in this course will be to make that relationship more reciprocal by using critical theory to shape technology in turn. For example, the prevailing system of “keyword search” requires scholars to begin by guessing how another era categorized the world. But much critical theory suggests that we cannot predict those categories in advance, and there are ways of mapping an archive that don’t require us to.

I’ve found that it does make a difference: when critics build their own tools, they can uncover trends and discourses that standard search technology does not reveal. The course will not assume any technical background, although it does assume willingness to learn a few basic elements of programming and statistics. Many of the tools/collections we need are already available on the web; others I can give you, or show you how to cobble together. We will often take time out from building things to read theory — like Moretti’s Maps, Graphs, Trees (2005), corpus linguistics, and influential critiques of or definitions of the digital humanities. But we will not mostly be writing about digital humanities. Instead I’ll recommend writing an ordinary critical essay about literary/cultural history, subtly informed by new tools or new models of discourse. (Underline “subtly.”) Projects on any period are possible, although the resources I can provide are admittedly richest between 1700 and 1900.

*****
By the way, it would be churlish of me not to acknowledge that I’ve learned much of what I know about this topic from grad students, and especially (where methodology is concerned) from Benjamin Schmidt, whose blog posts are an education in themselves and will certainly be on the syllabus. “Graduate education” in this field is a very circular process.

18c 19c math ngrams

Words that appear in the same 18c volumes also track each other over time, through the 19c.

Post author By tedunderwood
Post date September 19, 2011
No Comments on Words that appear in the same 18c volumes also track each other over time, through the 19c.

I wrote a long post last Friday arguing that topic-modeling an 18c collection is a reliable way of discovering eighteenth- and nineteenth-century trends, even in a different collection.

But when I woke up on Saturday I realized that this result probably didn’t depend on anything special about “topic modeling.” After all, the topic-modeling process I was using was merely a form of clustering. And all the clustering process does is locate the hypothetical centers of more-or-less-evenly-sized clouds of words in vector space. It seemed likely that the specific locations of these centers didn’t matter. The key was the underlying measure of association between words — “cosine similarity in vector space,” which is a fancy way of saying “tendency to be common in the same 18c volumes.” Any group of words that were common (and uncommon) in the same 18c volumes would probably tend to track each other over time, even into the next century.

GratifyCurves540 — Six words that tend to occur in the same 18c volumes as 'gratify' (in TCP-ECCO), plotted over time in a different collection (a corrected version of Google ngrams).

To test this I wrote a script that chose 200 words at random from the top 5000 in a collection of 2,193 18c volumes (drawn from TCP-ECCO with help from Laura Mandell), and then created a cluster around each word by choosing the 25 words most likely to appear in the same volumes (cosine similarity). Would pairs of words drawn from these randomly distributed clusters also show a tendency to correlate with each other over time?

They absolutely do. The Fisher weighted mean pairwise r for all possible pairs drawn from the same cluster is .267 in the 18c and .284 in the 19c (the 19c results are probably better because Google’s dataset is better in the 19c even after my efforts to clean the 18c up*). At n = 100 (measured over a century), both correlations have rock-solid statistical significance, p < .01. And in case you're wondering … yes, I wrote another script to test randomly selected words using the same statistical procedure, and the mean pairwise r for randomly selected pairs (factoring out, as usual, partial correlation with the larger 5000-word group they’re selected from) is .0008. So I feel confident that the error level here is low.**

What does this mean, concretely? It means that the universe of word-frequency data is not chaotic. Words that appear in the same discursive contexts tend, strongly, to track each other over time, and (although I haven’t tested this rigorously yet), it’s not hard to see that the converse proposition is also going to hold true: words that track each other over time are going to tend to have contextual associations as well.

To put it even more bluntly: pick a word, any word! We are likely to be able to define not just a topic associated with that word, but a discourse — a group of words that are contextually related and that also tend to wax and wane together over time. I don’t imagine that in doing so we prove anything of humanistic significance, but I do think it means that we can raise several thousand significant questions. To start with: what was the deal with eagerness, gratification, and disappointment in the second half of the eighteenth century?

* A better version of Google’s 18c dataset may be forthcoming from the NCSA.

** For people who care about the statistical reliability of data-mining, here’s the real kicker: if you run a Benjamini-Hochberg procedure on these 200 randomly-generated clusters, 134 of them have significance at p < .05 in the 19c even after controlling for the false discovery rate. To put that more intelligibly, these are guaranteed not to be xkcd’s green jelly beans. The coherence of these clusters is even greater than the ones produced by topic-modeling, but that’s probably because they are on average slightly smaller (25 words); I have yet to test the relative importance of different generation procedures while holding cluster size rigorously constant.

teaching undigitized humanities

It’s okay not to solve “the crisis of the humanities.”

Post author By tedunderwood
Post date September 18, 2011
4 Comments on It’s okay not to solve “the crisis of the humanities.”

I read Cathy Davidson’s latest piece in Academe with pleasure and admiration. She’s right that humanists need to think about the social function of our work, and right that this will require self-criticism. Moreover, Davidson’s work with HASTAC seems to me a model of the sort of innovation we need now.

However, Davidson says such kind things about the digital humanities that someone needs to pour in a few grains of salt. And since I’m a digital humanist, it might as well be me.

To reimagine a global humanism with relevance to the contemporary world means understanding, using, and contributing to new computational tools and methods. … Even a few examples show how being open to digital possibilities changes paradigms and brings new ways of reimagining the humanities into the world.

Reading this, I find myself blushing and stammering. And what I’m stammering is: “slow down a sec, because I’m not sure how central any of this is really going to be to our pedagogical mission.”

I’m going to teach a graduate course on digital humanities next semester, because I’m confident that information technology will change (actually, already has changed) the research end of our discipline. But I’m not yet sure about the implications at the undergraduate level. Maybe ten years from now I’ll be teaching text mining to undergrads … but then again, maybe the things undergraduates need most from an English course will still be historical perspective, close reading, a willingness to revise, and a habit of considering objections to their own thesis.

I’m sure that text mining belongs in undergraduate education somewhere. It raises fascinating social and linguistic puzzles. But I’m not sure whether we’ll be able to fit all the puzzles raised by technological change into the undergrad English major. It’s possible that English departments will want to stay focused on an older mission, leaving these new challenges to be scooped up by Linguistics or Computer Science. If that happens, it’s okay with me. It’s not particularly crucial that all the projects I care about be combined in a single department.

I’m dwelling on this because I feel humanists spend way too much time these days arguing about “what we need to do in order to keep the discipline from shrinking.” Sometimes the answer offered is a) return to our core competence, and sometimes the answer is b) boldly take on some new mission. But really I want to answer c) it is not our job to keep the discipline from shrinking, and we shouldn’t do anything purely for that reason. Our job is to make sure that we keep passing on the critical skills that the humanities develop best, at the same time as we explore new intellectual challenges.

Maybe those new challenges require us to expand. Or maybe it turns out that new challenges are relevant mostly at the graduate level, whereas at the undergraduate level we already have our hands full teaching students social history, close reading, and revision. And maybe that means that departments of English do end up shrinking relative to Communications or CompSci. If so, I hope it doesn’t happen rapidly, because I care about the fortunes of particular graduate students. But in the long term, it would not be a tragedy. Ideas matter. Departmental boundaries don’t. Intellectual history is not a contest to see who can retain the most faculty.

UPDATE Dec. 30 2011: I have to admit that my mind is in the process of being changed about this. After participating in a NITLE-sponsored seminar about teaching digital humanities at the undergraduate level, I’m much less hesitant than I was in September. Ryan Cordell, Brian Croxall, and Jeff McClurken presented really impressive digital-humanities courses that were also deeply grounded in the context of a specific discipline. Recording available at the link above.

18c 19c math methodology topic modeling trend mining

Topics tend to be trends. Really: p < .05!

Post author By tedunderwood
Post date September 16, 2011
No Comments on Topics tend to be trends. Really: p < .05!

While I’m fascinated by cases where the frequencies of two, or ten, or twenty words closely parallel each other, my conscience has also been haunted by a problem with trend-mining — which is that it always works. There are so many words in the English language that you’re guaranteed to find groups of them that correlate, just as you’re guaranteed to find constellations in the night sky. Statisticians call this the problem of “multiple comparisons”; it rests on a fallacy that’s nicely elucidated in this classic xkcd comic about jelly beans.

Simply put: it feels great to find two conceptually related words that correlate over time. But we don’t know whether this is a significant find, unless we also know how many potentially related words don’t correlate.

One way to address this problem is to separate the process of forming hypotheses from the process of testing them. For instance, we could use topic modeling to divide the lexicon up into groups of terms that occur in the same contexts, and then predict that those terms will also correlate with each other over time. In making that prediction, we turn an undefined universe of possible comparisons into a finite set.

Once you create a set of topics, plotting their frequencies is simple enough. But plotting the aggregate frequency of a group of words isn’t the same thing as “discovering a trend,” unless the individual words in the group actually correlate with each other over time. And it’s not self-evident that they will.

Silence2 — The top 15 words in topic #91, "Silence/Listened," and their cosine similarity to the centroid.

So I decided to test the hypothesis that they would. I used semi-fuzzy clustering to divide one 18c collection (TCP-ECCO) into 200 groups of words that tend to appear in the same volumes, and then tested the coherence of those topics over time in a different 18c collection (a much-cleaned-up version of the Google ngrams dataset I produced in collaboration with Loretta Auvil and Boris Capitanu at the NCSA). Testing hypotheses in a different dataset than the one that generated them is a way of ensuring that we aren’t simply rediscovering the same statistical accidents a second time.

To make a long story short, it turns out that topics have a statistically significant tendency to be trends (at least when you’re working with a century-sized domain). Pairs of words selected from the same topic correlated significantly with each other even after factoring out other sources of correlation*; the Fisher weighted mean r for all possible pairs was 0.223, which measured over a century (n = 100) is significant at p < .05.

In practice, the coherence of different topics varied widely. And of course, any time you test a bunch of hypotheses in a row you're going to get some false positives. So the better way to assess significance is to control for the "false discovery rate." When I did that (using the Benjamini-Hochberg method) I found that 77 out of the 200 topics cohered significantly as trends.

There are a lot of technical details, but I'll defer them to a footnote at the end of this post. What I want to emphasize first is the practical significance of the result for two different kinds of researchers. If you're interested in mining diachronic trends, then it may be useful to know that topic-modeling is a reliable way of discovering trends that have real statistical significance and aren’t just xkcd’s “green jelly beans.”

The top 15 terms in topic #89, "Enemy/Attacked," and their cosine similarity to the centroid.

Conversely, if you're interested in topic modeling, it may be useful to know that the topics you generate will often be bound together by correlation over time as well. (In fact, as I’ll suggest in a moment, topics are likely to cohere as trends beyond the temporal boundaries of your collection!)

Finally, I think this result may help explain a phenomenon that Ryan Heuser, Long Le-Khac, and I have all independently noticed: which is that groups of words that correlate over time in a given collection also tend to be semantically related. I've shown above that topic modeling tends to produce diachronically coherent trends. I suspect that the converse proposition is also true: clusters of words linked by correlation over time will turn out to have a statistically significant tendency to appear in the same contexts.

Why are topics and trends so closely related? Well, of course, when you’re topic-modeling a century-long collection, co-occurrence has a diachronic dimension to start with. So the boundaries between topics may already be shaped by change over time. It would be interesting to factor time out of the topic-modeling process, in order to see whether rigorously synchronic topics would still generate diachronic trends.

I haven’t tested that yet, but I have tried another kind of test, to rule out the possibility that we’re simply rediscovering the same trends that generated the topics in the first place. Since the Google dataset is very large, you can also test whether 18c topics continue to cohere as trends in the nineteenth century. As it turns out, they do — and in fact, they cohere slightly more strongly! (In the 19c, 88 out of 200 18c topics cohered significantly as trends.) The improvement is probably a clue that Google’s dataset gets better in the nineteenth century (which god knows, it does) — but even if that’s true, the 19c result would be significant enough on its own to show that topic modeling has considerable predictive power.

Practically, it’s also important to remember that “trends” can play out on a whole range of different temporal scales.

For instance, here’s the trend curve for topic #91, “Silence / Listened,” which is linked to the literature of suspense, and increases rather gradually and steadily from 1700 to the middle of the nineteenth century.

By contrast, here’s the trend curve for topic #89, “Enemy/Attacked,” which is largely used in describing warfare. It doesn’t change frequency markedly from beginning to end; instead it bounces around from decade to decade with a lot of wild outliers. But it is in practice a very tightly-knit trend: a pair of words selected from this topic will have on average 31% of their variance in common. The peaks and outliers are not random noise: they’re echoes of specific armed conflicts.

* Technical details: Instead of using Latent Dirichlet Allocation for topic modeling, I used semi-fuzzy c-means clustering on term vectors, where term vectors are defined in the way I describe in this technical note. I know LDA is the standard technique, and it seems possible that it would perform even better than my clustering algorithm does. But in a sufficiently large collection of documents, I find that a clustering algorithm produces, in practice, very coherent topics, and it has some other advantages that appeal to me. The “semi-fuzzy” character of the algorithm allows terms to belong to more than one cluster, and I use cosine similarity to the centroid to define each term’s “degree of membership” in a topic.

I only topic-modeled the top 5000 words in the TCP-ECCO collection. So in measuring pairwise correlations of terms drawn from the same topic, I had to calculate it as a partial correlation, controlling for the fact that terms drawn from the top 5k of the lexicon are all going to have, on average, a slight correlation with each other simply by virtue of being drawn from that larger group.

18c 19c fiction methodology

For most literary scholars, text mining is going to be an exploratory tool.

Post author By tedunderwood
Post date August 15, 2011
4 Comments on For most literary scholars, text mining is going to be an exploratory tool.

Having just returned from a conference of Romanticists, I’m in a mood to reflect a bit about the relationship between text mining and the broader discipline of literary studies. This entry will be longer than my usual blog post, because I think I’ve got an argument to make that demands a substantial literary example. But you can skip over the example to extract the polemical thesis if you like!

At the conference, I argued that literary critics already practice a crude form of text mining, because we lean heavily on keyword search when we’re tracing the history of a topic or discourse. I suggested that information science can now offer us a wider range of tools for mapping archives — tools that are subtler, more consonant with our historicism, and maybe even more literary than keyword search is.

At the same time, I understand the skepticism that many literary critics feel. Proving a literary thesis with statistical analysis is often like cracking a nut with a jackhammer. You can do it: but the results are not necessarily better than you would get by hand.

One obvious solution would be to use text mining in an exploratory way, to map archives and reveal patterns that a critic could then interpret using nuanced close reading. I’m finding that approach valuable in my own critical practice, and I’d like to share an example of how it works. But I also want to reflect about the social forces that stand in the way of this obvious compromise between digital and mainstream humanists — leading both sides to assume that quantitative analysis ought to contribute instead by proving literary theses with increased certainty.

Annote — Part of a topic tree based on a generically diverse collection of 2200 18c texts.

I’ll start with an example. If you don’t believe text mining can lead to literary insights, bear with me: this post starts with some goofy-looking graphs, but develops into an actual hypothesis about the Romantic novel based on normal kinds of literary evidence. But if you’re willing to take my word that text-mining can produce literary leads, or simply aren’t interested in Romantic-era fiction, feel free to skip to the end of this (admittedly long!) post for the generalizations about method.

Several months ago, when I used hierarchical clustering to map eighteenth-century diction on this blog, I pointed to a small section of the resulting tree that intriguingly mixed language about feeling with language about time. It turned out that the words in this section of the tree were represented strongly in late-eighteenth-century novels (novels, for instance, by Frances Burney, Sophia Lee, and Ann Radcliffe). Other sections of the tree, associated with poetry or drama, had a more vivid kind of emotive language, and I wondered why novels would combine an emphasis on feeling or exclamation (“felt,” “cried”) with the abstract topic of duration (“moment,” “longer”). It seemed an oddly phenomenological way to think about emotion.

But I also realized that hierarchical clustering is a fairly crude way of mapping conceptual space in an archive. The preferred approach in digital humanities right now is topic modeling, which does elegantly handle problems like polysemy. However, I’m not convinced that existing methods of topic modeling (LDA and so on) are flexible enough to use for exploration. One of their chief advantages is that they don’t require the human user to make judgment calls: they automatically draw boundaries around discrete “topics.” But for exploratory purposes boundaries are not an advantage! In exploring an archive, the goal is not to eliminate ambiguity so that judgment calls are unnecessary: the goal is to reveal intriguing ambiguities, so that the human user can make judgments about them.

If this is our goal, it’s probably better to map diction as an associative web. Fortunately, it was easy to get from the tree to a web, because the original tree had been based on an algorithm that measured the strength of association between any two words in the collection. Using the same algorithm, I created a list of twenty-five words most strongly associated with the branch that had interested me (“instantly,” “cried,” “felt,” moment,” “longer,”) and then used the strengths of association between those words to model the whole list as a force-directed graph. In this graph, words are connected by “springs” that pull them closer together; the darker the line, the stronger the association between the two words, and the more tightly they will be bound together in the graph. (The sizes of words are loosely proportional to their frequency in the collection, but only very loosely.)

A graph like this is not meant to be definitive: it’s a brainstorming tool that helps me explore associations in a particular collection (here, a generically diverse collection of eighteenth-century writing). On the left side, we see a triangle of feminine pronouns (which are strongly represented in the same novels where “felt,” “moment,” and so on are strongly represented) as well as language that defines domestic space (“quitting,” “room”). On the right side of the graph, we see a range of different kinds of emotion. And yet, looking at the graph as a whole, there is a clear emphasis on an intersection of feeling and time — whether the time at issue is prospective (“eagerly,” “hastily,” “waiting”) or retrospective (“recollected,” “regret”).

In particular, there are a lot of words here that emphasize temporal immediacy, either by naming a small division of time (“moment,” “instantly”), or by defining a kind of immediate emotional response (“surprise,” “shocked,” “involuntarily”). I have highlighted some of these words in red; the decision about which words to include in the group was entirely a human judgment call — which means that it is open to the same kind of debate as any other critical judgment.

But the group of words I have highlighted in red — let’s call it a discourse of temporal immediacy — does turn out to have an interesting historical profile. We already know that this discourse was common in late-eighteenth-century novels. But we can learn more about its provenance by restricting the generic scope of the collection (to fiction) and expanding its temporal scope to include the nineteenth as well as eighteenth centuries. Here I’ve graphed the aggregate frequency of this group of words in a collection of 538 works of eighteenth- and nineteenth-century fiction, plotted both as individual works and as a moving average. [The moving average won’t necessarily draw a line through the center of the “cloud,” because these works vary greatly in size. For instance, the collection includes about thirty of Hannah More’s “Cheap Repository Tracts,” which are individually quite short, and don’t affect the moving average more than a couple of average-sized novels would, although they create an impressive stack of little circles in the 1790s.]

The shape of the curve here suggests that we’re looking at a discourse that increased steadily in prominence through the eighteenth century and peaked (in fiction) around the year 1800, before sinking back to a level that was still roughly twice its early-eighteenth-century frequency.

Why might this have happened? It’s always a good idea to start by testing the most boring hypothesis — so a first guess might be that words like “moment” and “instantly” were merely displacing some set of close synonyms. But in fact most of the imaginable synonyms for this set of words correlate very closely with them. (This is true, for instance, of words like “sudden,” “abruptly,” and “alarm.”)

Another way to understand what’s going on would be to look at the works where this discourse was most prominent. We might start by focusing on the peak between 1780 and 1820. In this period, the works of fiction where language of temporal immediacy is most prominent include

Zofloya

Hermione, or the Orphan Sisters

The Monk

A Sicilian Romance

The Castles of Athlin and Dunbayne

The Romance of the Forest

Cecilia

The Wanderer

Adeline Mowbray

The Recess

There is a strong emphasis here on the Gothic, but perhaps also, more generally, on women writers. The works of fiction where the same discourse is least prominent would include

Coelebs in Search of a Wife

Hermsprong

Life; or the Adventures of William Ramble

Castle Rackrent

The Children’s Friend

Vaurien; or, Sketches of the Times

Many of these works are deliberately old-fashioned in their approach to narrative form: they are moral parables, or stories for children, or first-person retrospective narratives (like Rackrent), or are told by Fieldingesque narrators who feel free to comment and summarize extensively (as in the works by Disraeli and Trusler).

After looking closely at the way the language of temporal immediacy is used in Frances Burney, Cecilia (1782), and Sophia Lee, The Recess (1785), it seems to me that it had both a formal and an affective purpose.

Formally, it foregrounded a newly sharp kind of temporal framing. If we believe Ian Watt, part of the point of the novel form is to emulate the immediacy of first-hand experience — a purpose that can be slightly at odds with the retrospective character of narrative. Eighteenth-century novelists fought the distancing effect of retrospection in a lot of ways: epistolary narrative, discovered journals and so on are ways of bringing the narrative voice as close as possible to the moment of experience. But those tricks have limits; at some point, if your heroine keeps running off to write breathless letters between every incident, Henry Fielding is going to parody you.

By the late eighteenth century it seems to me novelists were starting to work out ways of combining temporal immediacy with ordinary retrospective narration. Maybe you can’t literally have your narrator describe events as they’re taking place, but you can describe events in a way that highlights their temporal immediacy. This is one of the things that makes Frances Burney read more like a nineteenth-century novelist than like Defoe; she creates a tight temporal frame for each event, and keeps reminding her readers about the tightness of the frame. So, a new paragraph will begin “A few moments after he was gone …” or “At that moment Sir Robert himself burst into the Room …” or “Cecilia protested she would go instantly to Mr Briggs,” to choose a few examples from a single chapter of Cecilia (my italics, 363-71). We might describe this vaguely as a way of heightening suspense — but there are of course many different ways to produce suspense in fiction. Narratology comes closer to the question at issue when it talks about “pacing,” but unless someone has already coined a better term, I think I would prefer to describe this late-18c innovation as a kind of “temporal framing,” because the point is not just that Burney uses “scene” rather than “summary” to make discourse time approximate story time — but that she explicitly divides each “scene” into a succession of discrete moments.

There is a lot more that could be said about this aspect of narrative form. For one thing, in the Romantic era it seems related to a particular way of thinking about emotion — a strategy that heightens emotional intensity by describing experience as if it were divided into a series of instananeous impressions. E.g, “In the cruelest anxiety and trepidation, Cecilia then counted every moment till Delvile came …” (Cecilia, 613). Characters in Gothic fiction are “every moment expecting” some start, shock, or astonishment. “The impression of the moment” is a favorite phrase for both Burney and Sophia Lee. On one page of The Recess, a character “resign[s] himself to the impression of the moment,” although he is surrounded by a “scene, which every following moment threatened to make fatal” (188, my italics).

In short, fairly simple tools for mapping associations between words can turn up clues that point to significant formal, as well as thematic, patterns. Maybe I’m wrong about the historical significance of those patterns, but I’m pretty sure they’re worth arguing about in any case, and I would never have stumbled on them without text mining.

On the other hand, when I develop these clues into a published article, the final argument is likely to be based largely on narratology and on close readings of individual texts, supplemented perhaps by a few simple graphs of the kind I’ve provided above. I suppose I could master cutting-edge natural language processing, in order to build a fabulous tool that would actually measure narrative pace, and the division of scenes into incidents. That would be fun, because I love coding, and it would be impressive, since it would prove that digital techniques can produce literary evidence. But the thing is, I already have an open-source application that can measure those aspects of narrative form, and it runs on inexpensive hardware that requires only water, glucose, and caffeine.

The methodological point I want to make here is that relatively simple forms of text mining, based on word counts, may turn out to be the techniques that are in practice most useful for literary critics. Moreover, if I can speak frankly: what makes this fact hard for us to acknowledge is not technophilia per se, but the nature of the social division between digital humanists and mainstream humanists. Literary critics who want to dismiss text mining are fond of saying “when you get right down to it, it’s just counting words.” (At moments like this we seem to forget everything 20c literary theorists ever learned from linguistics, and go back to treating language as a medium that ideally, ought to be immaterial and transparent. Surely a crudely verbal approach — founded on lumpy, ambiguous words — can never tell us anything about the conceptual subtleties of form and theme!) Stung by that critique, digital humanists often feel we have to prove that our tools can directly characterize familiar literary categories, by doing complex analyses of syntax, form, and sentiment.

I don’t want to rule out those approaches; I’m not interested in playing the game “Computers can never do X.” They probably can do X. But we’re already carrying around blobs of wetware that are pretty good at understanding syntax and sentiment. Wetware is, on the other hand, terrible at counting several hundred thousand words in order to detect statistical clues. And clues matter. So I really want to urge humanists of all stripes to stop imagining that text mining has to prove its worth by proving literary theses.

That should not be our goal. Full-text search engines don’t perform literary analysis at all. Nor do they prove anything. But literary scholars find them indispensable: in fact, I would argue that search engines are at least partly responsible for the historicist turn in recent decades. If we take the same technology used in those engines (a term-document matrix plus vector space math), and just turn the matrix on its side so that it measures the strength of association between terms rather than documents, we will have a new tool that is equally valuable for literary historians. It won’t prove any thesis by itself, but it can uncover a whole new range of literary questions — and that, it seems to me, ought to be the goal of text mining.

References
Frances Burney, Cecilia; or, Memoirs of an Heiress, ed. Peter Sabor and Margaret Ann Doody (Oxford: OUP, 1999).
Sophia Lee, The Recess; or, a Tale of Other Times, ed. April Alliston (Lexington: UP of Kentucky, 2000).

[Postscript: This post, originally titled “How to make text mining serve literary history,” is a version of a talk I gave at NASSR 2011 in Park City, Utah, sharpened by the discussion that took place afterward. I’d like to thank the organizers of the conference (Andrew Franta and Nicholas Mason) as well as my co-panelists (Mark Algee-Hewitt and Mark Schoenfield) and everyone in the audience. The original slides are here; sometimes PowerPoint can be clearer than prose.

I don’t mean to deny, by the way, that the simple tools I’m using could be refined in many ways — e.g., they could include collocations. What I’m saying is I that don’t think we need to wait for technical refinements. Our text-mining tools are already sophisticated enough to produce valuable leads, and even after we make them more sophisticated, it will remain true that at some point in the critical process we have to get off the bicycle and walk.]