The Stone and the Shell

Digital humanities and the spy business.

Post author By tedunderwood
Post date June 30, 2011
3 Comments on Digital humanities and the spy business.

lego-spy — Flickr / dunechaser (Creative Commons)

I’m surprised more digital humanists haven’t blogged the news that the US Intelligence Advanced Projects Activity wants to fund techniques for mining and categorizing metaphors.

The stories I’ve read so far have largely missed the point of the program. They focus instead on the amusing notion that the government “fancies a huge metaphor repository.” And it’s true that the program description reads a bit like a section of English 101 taught by the men from Dragnet. “The Metaphor Program will exploit the fact that metaphors are pervasive in everyday talk and reveal the underlying beliefs and worldviews of members of a culture.” What is “culture,” you ask? Simply refer to section 1.A.3., “Program Definitions”: “Culture is a set of values, attitudes, knowledge and patterned behaviors shared by a group.”

This seems accurate enough, although the combination of precision and generality does feel a little freaky. “Affect is important because it influences behavior; metaphors have been associated with affect.”

The program announcement is similarly precise about the difference between metaphor and metonymy. (They’re not wild about metonymy.)

(3) Figurative Language: The only types of figurative language that are included in the program are metaphors and metonymy.
• Metonymy may be proposed in addition to but not instead of metaphor analysis. Those interested in metonymy must explain why metonymy is required, what metonymy adds to the analysis and how it complements the proposed work on metaphors.

All this is fun, but the program also has a purpose that hasn’t been highlighted by most of the reporting I’ve seen. The second phase of the program will use statistical analysis of metaphors to “characterize differing cultural perspectives associated with case studies of the types of interest to the Intelligence Community.” One can only speculate about those types, but I imagine that we’re talking about specific political movements and religious groups. The goal is ostensibly to understand their “cultural perspectives,” but it seems quite possible that an unspoken, longer-term goal might involve profiling and automatically identifying members of demographic, vocational, or political groups. (IARPA has inherited some personnel and structures once associated with John Poindexter’s Total Information Awareness program.) The initial phase of the metaphor-mining is going to focus on four languages: “American English, Iranian Farsi, Russian Russian and Mexican Spanish.”

Naturally, my feelings are complex. Automatically extracting metaphors from text would be a neat trick, especially if you also distinguished metaphor from metonymy. (You would have to know, for instance, that “Oval Office” is not a metaphor for the executive branch of the US government.) [UPDATE: Arno Bosse points out that Brad Pasanek has in fact been working on techniques for automatic metaphor extraction, and has developed a very extensive archive. Needless to say, I don’t mean to associate Brad with the IARPA project.]

Going from a list of metaphors to useful observations about a “cultural perspective” would be an even neater trick, and I doubt that it can be automated. My doubts on that score are the main source of my suspicion that the actual deliverable of the grant will turn out to be profiling. That may not be the intended goal. But I suspect it will be the deliverable because I suspect that it’s the part of the project researchers will get to work reliably. It probably is possible to identify members of specific groups through statistical analysis of the metaphors they use.

On the other hand, I don’t find this especially terrifying, because it has a Rube Goldberg indirection to it. If IARPA wants to automatically profile people based on digital analysis of their prose, they can do that in simpler ways. The success of stylometry indicates that you don’t need to understand the textual features that distinguish individuals (or groups) in order to make fairly reliable predictions about authorship. It may well turn out that people in a particular political movement overuse certain prepositions, for reasons that remain opaque, although the features are reliably predictive. I am confident, of course, that intelligence agencies would never apply a technique like this domestically.

Postscript: I should credit Anna Kornbluh for bringing this program to my attention.

Uncategorized

Why humanists need to understand text mining.

Post author By tedunderwood
Post date June 29, 2011
8 Comments on Why humanists need to understand text mining.

Humanists are already doing text mining; we’re just doing it in a theoretically naive way. Every time we search a database, we use complex statistical tools to sort important documents from unimportant ones. We don’t spend a lot of time talking about this part of our methodology, because search engines hide the underlying math, making the sorting process seem transparent.

But search is not a transparent technology: search engines make a wide range of different assumptions about similarity, relevance, and importance. If (as I’ve argued elsewhere) search engines’ claim to identify obscure but relevant sources has powerfully shaped contemporary historicism, then our critical practice has come to depend on algorithms that other people write for us, and that we don’t even realize we’re using. Humanists quite properly feel that humanistic research ought to be shaped by our own critical theories, not by the whims of Google. But that can only happen if we understand text mining well enough to build — or at least select — tools more appropriate for our discipline.

altavista460 — The AltaVista search page, circa 1996. This was the moment to freak out about text mining.

This isn’t an abstract problem; existing search technology sits uneasily with our critical theory in several concrete ways. For instance, humanists sometimes criticize text mining by noting that words and concepts don’t line up with each other in a one-to-one fashion. This is quite true: but it’s a critique of humanists’ existing search practices, not of embryonic efforts to improve them. Ordinary forms of keyword search are driven by individual words in a literal-minded way; the point of more sophisticated strategies — like topic modeling — is precisely that they pay attention to looser patterns of association in order to reflect the polysemous character of discourse, where concepts always have multiple names and words often mean several different things.

Perhaps more importantly, humanists have resigned themselves to a hermeneutically naive approach when they accept the dart-throwing game called “choosing search terms.” One of the basic premises of historicism is that other social forms are governed by categories that may not line up with our own; to understand another place or time, a scholar needs to begin by eliciting its own categories. Every time we use a search engine to do historical work we give the lie to this premise by assuming that we already know how experience is organized and labeled in, say, seventeenth-century Spain. That can be a time-consuming assumption, if our first few guesses turn out to be wrong and we have to keep throwing darts. But worse, it can be a misleading assumption, if we accept the first or second set of results and ignore concepts whose names we failed to guess. The point of more sophisticated text-mining techniques — like semantic clustering — is to allow patterns to emerge from historical collections in ways that are (if not absolutely spontaneous) at least a bit less slavishly and minutely dependent on the projection of contemporary assumptions.

I don’t want to suggest that we can dispense with search engines; when you already know what you’re looking for, and what it’s called, a naive search strategy may be the shortest path between A and B. But in the humanities you often don’t know precisely what you’re looking for yet, or what it’s called. And in those circumstances, our present search strategies are potentially misleading — although they remain powerful enough to be seductive. In short, I would suggest that humanists are choosing the wrong moment to get nervous about the distorting influence of digital methods. Crude statistical algorithms already shaped our critical practice in the 1990s when we started relying on keyword search; if we want to take back the reins, each humanist is going to need to understand text mining well enough to choose the tools appropriate for his or her own theoretical premises.

The history of an association, part two.

Post author By tedunderwood
Post date May 6, 2011
No Comments on The history of an association, part two.

Here’s another attempt to animate the history of a cluster of associated words — this time as a force-directed graph that folds and unfolds itself as the window of time moves forward, and changing strengths of association create different tensions in the graph.

I had a lot of fun making this clip, but I don’t want to make exaggerated claims for it. These images might not mean very much to me if I hadn’t also read some of the books on which they’re based. The visualization only took a day to build, though, and I think it might turn out to be a useful brainstorming tool. In this instance the clip got me thinking about the different ways time is imagined in the “terror gothic” and in the “horror gothic.”

Association between words is measured here using a vector space model and a collection of more than five hundred works of British fiction. I realize it may seem strange that associations can form and disappear while an eighty-year search window moves forward only sixty years — at the end of this clip the cluster is disappearing while the window still overlaps with the period where the cluster started to emerge. It’s worth recalling that the model isn’t counting words, but measuring the association between them. An early-eighteenth-century work that didn’t use sentimental language at all would do nothing to dilute the association between sentimental terms. But a group of nineteenth-century works that used the same language differently could rapidly obscure earlier patterns.

In short, I suspect that the language of temporal immediacy (“moment,” “suddenly,” “immediately,” and so on) is strongly associated with feeling in the 18c in part because gothic novels, and novels of sensibility, just get to it first. In the nineteenth century other kinds of fiction may take up the same temporal language, diluting its specific connection to tremulous feeling. I can’t prove it yet, but the clues I’m seeing do point in that general direction.

18c 19c visualization

The history of an association.

Post author By tedunderwood
Post date May 4, 2011
No Comments on The history of an association.

[Update May 6th, 2011: The problem I describe here is solved a bit more effectively in a more recent post.] It’s fairly easy to visualize a cluster of associated words. But I’d also like to understand how these associations change, and visualizing that is trickier. For one thing, it’s not easy to define what it means to trace “the same” cluster across time; we need an approach that remains open to the possibility that a particular set of associations could simply weaken or dissolve. The video I’ve embedded below is a first, tentative stab at the problem. Move your mouse pointer away after clicking “play” to see the image without cropping.

I’m trying to understand a late-eighteenth-century convergence between the language of temporality and of feeling. Two words that seemed particularly strongly connected were “moment” and “felt.” So what I’ve done is to proceed five years at a time through a 200-year-long corpus, looking at 80-year-long windows from the corpus. In each “snapshot,” I select the twelve words that associate most strongly in vector space with a vector that’s composed of both “moment” and “felt.” In order to graph them on a coordinate plane, I also measure their association with each term separately. The y axis is association with “moment,” and the x axis is association with “felt.” The reference terms themselves are also plotted. This gives me a way to visualize strength of association in the whole cluster — basically, as everything gets closer to the upper-right-hand corner, the strength of association is getting stronger. At the same time we can get a general sense of the semantic character of the cluster.

I’m working with a relatively small collection here — 538 works of British fiction stretched out between 1700 and 1900. I have a larger 18th-century collection, but in this case I needed continuity over a longer span of time, and in order to achieve that I had to limit the collection to fiction, which reduced its size. It also means that the selection of words you’ll see here is different from the selection of words you saw in previous posts about the “felt-moment” convergence, which were based on a generically diverse collection.

Some of the things that are awkward about this video are consequences of the small collection size. For instance, given the small collection size, I have to choose a pretty long window (80 years out of an overall 200-year-long collection). The window is a bit shorter than that at the beginning of the video — for purely dramatic reasons, so that we don’t reach the “climax” of the clip too rapidly.

Also, of course, the stop-motion animation is rather jerky. With a larger collection, I think it might actually be possible to watch these terms move across the coordinate plane in a smooth and connected fashion. But given the small collection size, smooth motion would be illusory; the data don’t really support that level of precision.

However, even with all those caveats, I feel I’m learning something from the exercise. I think we are glimpsing the transformation of an associative cluster, and looking at the way it changes across time makes me more than ever suspect that — at the moment when it’s strongest — it has something to do with the way late-eighteenth-century fiction imagines suspense. “Anxiety” and “agitation” are durable presences, often in the upper-right-hand corner of the cluster. This interpretation is also, of course, based on reading some of the relevant works, and I think the next stage in exploring the question will be to go back and read them again. As always, I’m inclined to present text-mining more as an exploratory tool or brainstorming technique than as definitive evidence.

It is also a bit interesting to watch the language of gothic agitation turn into language of middle-class striving as we get into the nineteenth century. The intersection between “moment” and “felt” is increasingly occupied not by trembling but by terms like “energy,” “effort,” and “struggle.” I’m not quite sure what to make of that trajectory. Perhaps it helps explain the dissolution of the earlier cluster.

Another way of visualizing clusters like this might be to group terms in a force-directed graph and animate the evolution of the graph across time.

math methodology

Should we model “topics” as associative clusters, or as statistical factors?

Post author By tedunderwood
Post date April 30, 2011
5 Comments on Should we model “topics” as associative clusters, or as statistical factors?

I should say up front that this is going to be a geeky post about things that happen under the hood of the car. Many readers may be better served by scrolling down to “Trends, topics, and trending topics,” which has more to say about the visible applications of text mining.

I’ve developed a clustering methodology that I like pretty well. It allows me to map patterns of usage in a large collection by treating each term as a vector; I assess how often words occur together by measuring the angle between vectors, and then group the words with Ward’s clustering method. This produces a topic tree that seems to be both legible (in the sense that most branches have obvious affinities to a genre or subject category) and surprising (in the sense that they also reveal thematic connections I wouldn’t have expected). It’s a relatively simple technique that does what I want to do, practically, as a literary historian. (You can explore this map of eighteenth-century diction to check it out yourself; and I should link once again to Sapping Attention, which convinced me clustering could be useful.)

But as I learn more about Bayesian statistics, I’m coming to realize that it’s debatable whether the clusters of terms I’m finding count as topics at all. The topic-modeling algorithms that have achieved wide acceptance (for instance, Latent Dirichlet Allocation) are based on a clear definition of what a “topic” is. They hypothesize that the observed complexity of usage patterns is actually produced by a smaller set of hidden variables. Because those variables can be represented as lists of words, they’re called topics. But the algorithm isn’t looking for thematic connections between words so much as resolving a collection into a set of components or factors that could have generated it. In this sense, it’s related to a technique like Principal Component Analysis.

Deriving those hidden variables is a mathematical task of considerable complexity. In fact, it’s impossible to derive them precisely: you have to estimate. I can only understand the math for this when my head is constantly bathed in cool running water to keep the processor from overheating, so I won’t say much more about it — except that “factoring” is just a metaphor I’m using to convey the sort of causal logic involved. The actual math involves Bayesian inference rather than algebra. But it should be clear, anyway, that this is completely different from what I’m doing. My approach isn’t based on any generative model, and can’t claim to reveal the hidden factors that produce texts. It simply clusters words that are in practice associated with each other in a corpus.

I haven’t tried the Bayesian approach yet, but it has some clear advantages. For one thing, it should work better for purposes of classification and information retrieval, because it’s looking for topics that vary (at least in principle) independently of each other.* If you want to use the presence of a topic in a document to guide classification, this matters. A topic that correlated positively or negatively with another topic would become redundant; it wouldn’t tell you much you didn’t already know. It makes sense to me that people working in library science and information retrieval have embraced an approach that resolves a collection into independent variables, because problems of document classification are central to those disciplines.

On the other hand, if you’re interested in mapping associations between terms, or topics, the clustering approach has advantages. It doesn’t assume that topics vary independently. On the contrary, it’s based on a measure of association between terms that naturally extends to become a measure of association between the topics themselves. The clustering algorithm produces a branching tree structure that highlights some of the strongest relationships and contrasts, but you don’t have to stop there: any list of terms can be treated as a vector, and compared to any other list of terms.

InstantFeelings — A fragment of a larger tree, produced by clustering the top 1650 terms in a collection of 2,193 18c documents. Click through for a larger image giving more context.

Moreover, this flexibility means that you don’t have to treat the boundaries of “topics” as fixed. For instance, here’s part of the eighteenth-century tree that I found interesting: the words on this branch seemed to imply a strange connection between temporality and feeling, and they turned out to be particularly common in late-eighteenth-century novels by female writers. Intriguing, but we’re just looking at five words. Maybe the apparent connection is a coincidence. Besides, “cried” is ambiguous; it can mean “exclaimed” in this period more often than it means “wept.” How do we know what to make of a clue like this? Well, given the nature of the vector space model that produced the tree, you can do this: treat the cluster of terms itself as a vector, and look for other terms that are strongly related to it. When I did that, I got a list that confirmed the apparent thematic connection, and helped me begin to understand it.

MomentLonger — A list of terms most strongly associated with the group "instantly, cried, felt, moment, longer" in a generically diverse corpus of 2,193 18c documents ranging from sermons to cookbooks. "Similarity" here is technically measured as the cosine of the angle between two vectors.

This is, very definitely, a list of words associated with temporality (moment, hastily, longer, instantly, recollected) and feeling (felt, regret, anxiety, astonishment, agony). Moreover, it’s fairly clear that the common principle uniting them is something like “suspense” (waiting, eagerly, impatiently, shocked, surprise). Gender is also involved — which might not have been immediately evident from the initial cluster, because gendered words were drawn even more strongly to other parts of the tree. But the associational logic of the clustering process makes it easy to treat topic boundaries as porous; the clusters that result don’t have to be treated as rigid partitions; they’re more appropriately understood as starting-places for exploration of a larger associational web.

[This would incidentally be my way of answering a valid critique of clustering — that it doesn’t handle polysemy well. The clustering algorithm has to make a choice when it encounters a word like “cried.” The word might in practice have different sets of associations, based on different uses (weep/exclaim), but it’s got to go in one branch or another. It can’t occupy multiple locations in the tree. We could try to patch that problem, but I think it may be better to realize that the problem isn’t as important as it appears, because the clusters aren’t end-points. Whether a term is obviously polysemous, or more subtly so, we’re always going to need to make a second pass where we explore the associations of the cluster itself in order to shake off the artificiality of the tree structure, and get a richer sense of multi-dimensional context. When we do that we’ll pick up words like “herself,” which could justifiably be located at any number of places in the tree.]

Much of this may already be clear to people in informatics, but I had to look at the math in order to understand that different kinds of “topic modeling” are really doing different things. Humanists are going to have some tricky choices to make here that I’m not sure we understand yet. Right now the Bayesian “factoring” approach is more prominent, partly because the people who develop text-mining algorithms tend to work in disciplines where classification problems are paramount, and where it’s important to prove that they can be solved without human supervision. For literary critics and historians, the appropriate choice is less clear. We may sometimes be interested in classifying documents (for instance, when we’re reasoning about genre), and in that case we too may need something like Latent Dirichlet Allocation or Principal Component Analysis to factor out underlying generative variables. But we’re just as often interested in thematic questions — and I think it’s possible that those questions may be more intuitively, and transparently, explored through associational clustering. To my mind, it’s fair to call both processes “topic modeling” — but they’re exploring topics of basically different kinds.

Postscript: I should acknowledge that there are lots of ways of combining these approaches, either by refining LDA itself, or by combining that sort of topic-factoring approach with an associational web. My point isn’t that we have to make a final choice between these processes; I’m just reflecting that, in principle, they do different things.

* (My limited understanding of the math behind Latent Dirichlet Allocation is based on a 2009 paper by D.M. Blei and J. D. Lafferty available here.)

18c 19c methodology ngrams topic modeling

Trends, topics, and trending topics.

Post author By tedunderwood
Post date April 19, 2011
3 Comments on Trends, topics, and trending topics.

I’ve developed a text-mining strategy that identifies what I call “trending topics” — with apologies to Twitter, where the term is used a little differently. These are diachronic patterns that I find practically useful as a literary historian, although they don’t fit very neatly into existing text-mining categories.

A “topic,” as the term is used in text-mining, is a group of words that occur together in a way that defines a thematic focus. Cameron Blevin’s analysis of Martha Ballard’s diary is often cited as an example: Blevin identifies groups of words that seem to be associated, for instance, with “midwifery,” “death,” or “gardening,” and tracks these topics over the course of the diary.

“Trends” haven’t received as much attention as topics, but we need some way to describe the pattern that Google’s ngram viewer has made so visible, where groups of related words rise and fall together across long periods of time. I suspect “trend” is as a good a name for this phenomenon as we’ll get.

2011colors — blue, red, green, yellow, in the English corpus 1750-2000

From 1750 to 1920, the prominence of color vocabulary increases by a factor of three, for instance: and when it does, the names of different colors track each other very closely. I would call this a trend. Moreover, it’s possible to extend the principle that conceptually related words rise and fall together beyond cases like the colors and seasons where we’re dealing with an obvious physical category.

AnimAttenArdour — Google data graphed with my own viewer; if you compare this to Google's viewer, remember that I'm merging capitalized and uncapitalized forms, as well as ardor/ardour.

“Animated,” “attentive,” and “ardour” track each other almost as closely as the names of primary colors (the correlation coefficients are around 0.8), and they characterize conduct in ways that are similar enough to suggest that we’re looking at the waxing and waning not just of a few random words, but of a conceptual category — say, a particular sort of interest in states of heightened receptiveness or expressivity.

I think we could learn a lot by thoughtfully considering “trends” of this sort, but it’s also a kind of evidence that’s not easy to interpret, and that could easily be abused. A lot of other words correlate almost as closely with “attentive,” including “propriety,” “elegance,” “sentiments,” “manners,” “flattering,” and “conduct.” Now, I don’t think that’s exactly a random list (these terms could all be characterized loosely as a discourse of manners), but it does cover more conceptual ground than I initially indicated by focusing on words like “animated” and “ardour.” And how do we know that any of these terms actually belonged to the same “discourse”? Perhaps the books that talked about “conduct” were careful not to talk about “ardour”! Isn’t it possible that we have several distinct discourses here that just happened to be rising and falling at the same time?

In order to answer these questions, I’ve been developing a technique that mines “trends” that are at the same time “topics.” In other words, I look for groups of terms that hold together both in the sense that they rise and fall together (correlation across time), and in the sense that they tend to be common in the same documents (co-occurrence). My way of achieving this right now is a two-stage process: first I mine loosely defined trends from the Google ngrams dataset (long lists of, say, one hundred closely correlated words), and then I send those trends to a smaller, generically diverse collection (including everything from sermons to plays) where I can break the list into clusters of terms that tend to occur in the same kinds of documents.

I do this with the same vector space model and hierarchical clustering technique I’ve been using to map eighteenth-century diction on a larger scale. It turns the list of correlated words into a large, branching tree. When you look at a single branch of that tree you’re looking at what I would call a “trending topic” — a topic that represents, not a stable, more-or-less-familiar conceptual category, but a dynamically-linked set of concepts that became prominent at the same time, and in connection with each other.

MannersTree — one branch of a tree created by finding words that correlate with "manners," and then clustering them based on co-occurrence in 18c books

Here, for instance, is a branch of a larger tree that I produced by clustering words that correlate with “manners” in the eighteenth century. It may not immediately look thematically coherent. We might have expected “manners” to be associated with words like “propriety” or “conduct” (which do in fact correlate with it over time), but when we look at terms that change in correlated ways and occur in the same volumes, we get a list of words that are largely about wealth and rank (“luxury,” “opulence,” “magnificence”), as well as the puzzling “enervated.” To understand a phenomenon like this, you can simply reverse the process that generated it, by using the list as a search query in the eighteenth-century collection it’s based on. What turned up in this case were, pre-eminently, a set of mid-eighteenth-century works debating whether modern commercial opulence, and refinements in the arts, have had an enervating effect on British manners and civic virtue. Typical examples are John Brown’s Estimate of the Manners and Principles of the Times (1757) and John Trusler’s Luxury no Political Evil but Demonstratively Proved to be Necessary to the Preservation and Prosperity of States (1781). I was dimly aware of this debate, but didn’t grasp how central it became to debate about manners, and certainly wasn’t familiar with the works by Brown and Trusler.

I feel like this technique is doing what I want it to do, practically, as a literary historian. It makes the ngram viewer something more than a provocative curiosity. If I see an interesting peak in a particular word, I can can map the broader trend of which it’s a part, and then break that trend up into intersecting discourses, or individual works and authors.

Admittedly, there’s something inelegant about the two-stage process I’m using, where I first generate a list of terms and then use a smaller collection to break the list into clusters. When I discussed the process with Ben Schmidt and Miles Efron, they both, independently, suggested that there ought to be some simpler way of distinguishing “trends” from “topics” in a single collection, perhaps by using Principal Component Analysis. I agree about that, and PCA is an intriguing suggestion. On the other hand, the two-stage process is adapted to the two kinds of collections I actually have available at the moment: on the one hand, the Google dataset, which is very large and very good at mapping trends with precision, but devoid of metadata, on the other hand smaller, richer collections that are good at modeling topics, but not large enough to produce smooth trend lines. I’m going to experiment with Principal Component Analysis and see what it can do for me, but in the meantime — speaking as a literary historian rather than a computational linguist — I’m pretty happy with this rough-and-ready way of identifying trending topics. It’s not an analytical tool: it’s just a souped-up search technology that mines trends and identifies groups of works that could help me understand them. But as a humanist, that’s exactly what I want text mining to provide.

18c methodology topic modeling

The key to all mythologies.

Post author By tedunderwood
Post date April 7, 2011
2 Comments on The key to all mythologies.

Well, not really. But it is a classifying scheme that might turn out to be as loopy as Casaubon’s incomplete project in Middlemarch, and I thought I might embrace the comparison to make clear that I welcome skepticism.

In reality, it’s just a map of eighteenth-century diction. I took the 1,650 most common words in eighteenth-century writing, and asked my iMac to group them into clusters that tend to be common in the same eighteenth-century works. Since the clustering program works recursively, you end up with a gigantic branching tree that reveals how closely words are related to each other in 18c practice. If they appear on the same “branch”; they tend to occur in the same works. If they appear on the same “twig,” that tendency is even stronger.

You wouldn’t necessarily think that two words happening to occur in the same book would tell you much, but when you’re dealing with a large number of documents, it seems there’s a lot of information contained in the differences between them. In any case, this technique produced a detailed map of eighteenth-century topics that seemed — to me, anyway — surprisingly illuminating. To explore a couple of branches, or just marvel at this monument of digital folly, click here, or on the illustration to the right. That’ll take you through to a page where you can click on whichever branches interest you. (Click on the links in the right-hand margin, not the annotations on the tree itself.) To start with, I recommend Branch 18, which is a sort of travel narrative, Branch 13, which is 18c poetic diction in a nutshell, and Branch 5, which is saying something about gender and/or sexuality that I don’t yet understand.

If you want to know exactly how this was produced, and contrast it to other kinds of topic modeling, I describe the technique in this “technical note.” I should also give thanks to the usual cast of characters. Ryan Heuser and Ben Schmidt have produced analogous structures which gave me the idea of attempting this. Laura Mandell and 18th Connect helped me obtain the eighteenth-century texts on which the tree was based.

18c methodology

Revealing the relationships between topics in a corpus.

Post author By tedunderwood
Post date April 4, 2011
No Comments on Revealing the relationships between topics in a corpus.

[UPDATE April 7: The illustrations in this post are now out of date, though some of the explanation may still be useful. The kinds of diction mapped in these illustrations are now mapped better in branches 13-14, 18, and 1 of this larger topic tree.] While trying to understand the question I posed in my last post (why did style become less “conversational” in the 18th century?), I stumbled on a technique that might be useful to other digital humanists. I thought I might pause to describe it.

The technique is basically a kind of topic modeling. But whereas most topic modeling aims to map recurring themes in a single work, this technique maps topics at the corpus level. In other words, it identifies groups of words that are linked by the fact that they tend to occur in the same kinds of books. I’m borrowing this basic idea from Ben Schmidt, who used tf-idf scores to do something similar. I’ve taken a slightly different approach by using a “vector space model,” which I prefer for reasons I’ll describe in some technical notes. But since you’ll need to see results before you care about the how, let me start by showing you what the technique produces.

BetterSentiment — part of a topic tree based on 2,200 18c works

This branch of a larger tree structure was produced by a clustering program that groups words together when they resemble each other according to some measure of similarity. In this case I defined “similarity” as a tendency to occur in the same eighteenth-century texts. Since the tree structure records the sequence of grouping operations, it can register different nested levels of similarity. In the image above, for instance, we can see that “proud” and “pride” are more likely to occur in the same texts than either is to occur together with “smile” or “gay.” But since this is just one branch of a much larger tree, all of these words are actually rather likely to occur together.

This tree is based on a generically diverse collection of 18c texts drawn from ECCO-TCP with help from 18thConnect, and was produced by applying the clustering program I wrote to the 1350 most common words in that collection. The branch shown above represents about 1/50th of the whole tree. But I’ve chosen this branch to start with because it neatly illustrates the underlying principle of association. What do these words have in common? They’re grouped together because they appear in the same kinds of texts, and it’s fairly clear that the “kinds” in this case are poetic. We could sharpen that hypothesis by using this list of words as a search query to see exactly which texts it turns up, but given the prevalence of syncope (“o’er” and “heav’n”), poetry is a safe guess.

It is true that semantically related words tend to be strongly grouped in the tree. Ease/care, charms/fair and so on, are closely linked. But that isn’t a rule built into the algorithm I’m using; the fact that it happens is telling us something about the way words are in practice distributed in the collection. As a result, you get a snapshot of eighteenth century “poetic diction,” not just in the sense of specialized words like “oft,” but in the sense that you can see which themes counted as “poetic” in the eighteenth century, and possibly gather some clues about the way those themes were divided into groups. (In order to find out whether those divisions were generic or historical, you would need to turn the process around and use the sublists as search queries.)

TravelW — part of a topic tree based on 2,200 18c works

Here’s another part of the tree, showing words that are grouped together because they tend to appear in accounts of travel. The words at the bottom of the image (from “main” to “ships”) are very clearly connected to maritime travel, and the verbs of motion at the top of the image are connected to travel more generally. It’s less obvious that diurnal rhythms like morning/evening and day/night would be described heavily in the same contexts, but apparently they are.

In trees like these, some branches are transparently related to a single genre or subject category, while others are semantically fascinating but difficult to interpret as reflections of a single genre. They may well be produced by the intersection or overlap of several different generic (or historical) categories, and it’ll require more work to understand the nature of the overlap. In a few days I’ll post an overview of the architecture of the whole 1350-word eighteenth-century tree. It’ll be interesting to see how its architecture changes when I slide the collection forward in time to cover progressively later periods (like, say, 1750-1850). But I’m finding the tree interesting for reasons that aren’t limited to big architectural questions of classification: there are interesting thematic clues at every level of the structure. Here’s a portion of one that I constructed with a slightly different list of words.

SentimentW — part of a topic tree based on 2,200 18c works

Broadly, I would say that this is the language of sentiment: “alarm,” “softened,” “shocked,” “warmest,” “unfeeling.” But there are also ringers in there, and in a way they’re the most interesting parts. For instance, why are “moment” and “instantly” part of the language of sentiment in the eighteenth century?

Tags y

18c 19c linguistics

“… a selection of the language really spoken by men”?

Post author By tedunderwood
Post date March 17, 2011
4 Comments on “… a selection of the language really spoken by men”?

William Wordsworth’s claim to have brought poetry back to “the language of conversation in the middle and lower classes of society” gets repeated to each new generation of students (1). But did early nineteenth-century writing in general become more accessible, or closer to speech? It’s hard to say. We’ve used remarks like Wordsworth’s to anchor literary history, but we haven’t had a good way to assess their representativeness.

Increasingly, though, we’re in a position to test some familiar stories about literary history — to describe how the language of one genre changed relative to others, or even relative to “the language of conversation.” We don’t have eighteenth-century English speakers to interview, but we do have evidence about the kinds of words that tend to be more common in spoken language. For instance, Laly Bar-Ilan and Ruth Berman have shown in the journal Linguistics that contemporary spoken English is distinguished from writing by containing a higher proportion of words from the Old English part of the lexicon (2). This isn’t terribly surprising, since English was for a couple of hundred years (1066-1250) almost exclusively a spoken language, while French and Latin were used for writing. Any word that entered English before this period, and survived, had to be the kind of word that gets used in conversation. Words that entered afterward were often borrowed from French or Latin to flesh out the written language.

If the spoken language was distinguished from writing this way in the thirteenth century, and the same thing holds true today, then one might expect it to hold true in the eighteenth and nineteenth centuries as well. And it does seem to hold true: eighteenth-century drama, written to be spoken on stage, is distinguished from nondramatic poetry and prose by containing a higher proportion of Old English words. This is a broad-brush approach to diction, and not one that I would use to describe individual works. But applied to an appropriately large canvas, it may give us a rough picture of how the “register” of written diction has changed across time, becoming more conversational or more formal.

This graph is based on a version of the Google English corpus that I’ve cleaned up in a number of ways. Common OCR errors involving s, f, and ct have been corrected. The graph shows the aggregate frequency of the 500 most common English words that entered the language before the twelfth century. (I’ve found date-of-entry a more useful metric of a word’s affinity with spoken language than terms like “Latinate” or “Germanic.” After all, “Latinate” words like “school,” “street,” and “wall” don’t feel learned to us, because they’ve been transmitted orally for more than a millennium.) I’ve excluded a list of stopwords that includes determiners, prepositions, pronouns, and conjunctions, as well as the auxiliary verbs “be,” “will,” and “have.”

In relative terms, the change here may not look enormous; the peak in the early eighteenth century (181 words per thousand) is only about 20% higher than the trough in the late eighteenth century (152 words per thousand). But we’re talking about some of the most common words in the language (can, think, do, self, way, need, know). It’s a bit surprising that this part of the lexicon fluctuates at all. You might expect to see a gradual decline in the frequency of these words, as the overall size of the lexicon increases. But that’s not what happens: instead we see a rapid decline in the eighteenth century (as prose becomes less like speech, or at least less like the imagined speech of contemporaneous drama), and then a gradual recovery throughout the nineteenth century.

What does this tell us about literature? Not much, without information about genre. After all, as I mentioned, dramatic writing is a lot closer to speech than, say, poetry is. This curve might just be telling us that very few plays got written in the late eighteenth century.

Fortunately it’s possible to check the Google corpus against a smaller corpus of individual texts categorized by genre. I’ve made an initial pass at the first hundred years of this problem using a corpus of 2,188 eighteenth-century books produced by ECCO-TCP, which I obtained in plain text with help from Laura Mandell and 18thConnect. Two thousand books isn’t a huge corpus, especially not after you divide them up by genre, so these results are only preliminary. But the initial results seem to confirm that the change involved the language of prose itself, and not just changes in the relative prominence of different genres. Both fiction and nonfiction prose show a marked change across the century. If I’m right that the frequency of pre-12c words is a fair proxy for resemblance to spoken language, they became less and less like speech.

“Fiction” is of course a fuzzy category in the eighteenth century. The blurriness of the boundary between a sensationalized biography and a “novel” is a lot of the point of being a novel. In the graph above, I’ve lumped biographies and collections of personal letters in with novels, because I’m less interested in distinguishing something unique about fiction than I am in confirming a broad change in the diction of nondramatic prose.

By contrast, there’s relatively little change in the diction of poetry and drama. The proportion of pre-twelfth-century words is roughly the same at the end of the century as it was at the beginning.

Are these results intuitive, or are they telling us something new? I think the general direction of these curves probably confirms some intuitions. Anyone who studies eighteenth and nineteenth-century English knows that you get a lot of long words around 1800. Sad things become melancholy, needs become a necessity, and so on.

What may not be intuitive is how broad and steady the arc of change appears to be. To the extent that we English professors have any explanation for the elegant elaboration of late-eighteenth-century prose, I think we tend to blame Samuel Johnson. But these graphs suggest that much of the change had already taken place by the time Johnson published his Dictionary. Moreover, our existing stories about the history of style put a lot of emphasis on poetry — for instance, on Wordsworth’s critique of poetic diction. But the biggest changes in the eighteenth century seem to have involved prose rather than poetry. It’ll be interesting to see whether that holds true in the nineteenth century as well.

How do we explain these changes? I’m still trying to figure out. In the next couple of weeks I’ll write a post asking what took up the slack: what kinds of language became common in books where old, common words were relatively underrepresented?

—– references —–
1) William Wordsworth and Samuel T. Coleridge, Lyrical Ballads, with a Few Other Poems (Bristol: 1798), i.
2) Laly Bar-Ilan and Ruth A. Berman, “Developing register differentiation: the Latinate-Germanic divide in English,” Linguistics 45 (2007): 1-35.

18c 19c methodology

Using clustering to explore the relationships between trending topics.

Post author By tedunderwood
Post date February 17, 2011
2 Comments on Using clustering to explore the relationships between trending topics.

Until recently, I’ve been limited to working with tools provided by other people. But in the last month or so I realized that it’s easier than it used to be to build these things yourself, so I gave myself a crash course in MySQL and R, with a bit of guidance provided by Ben Schmidt, whose blog Sapping Attention has been a source of many good ideas. I should also credit Matt Jockers and Ryan Heuser at the Stanford Literary Lab, who are doing fabulous work on several different topics; I’m learning more from their example than I can say here, since I don’t want to describe their research in excessive detail.

I’ve now been able to download Google’s 1gram English corpus between 1700 and 1899, and have normalized it to make it more useful for exploring the 18th and 19th centuries. In particular, I normalized case and developed a way to partly correct for the common OCR errors that otherwise make the Google corpus useless in the eighteenth century: especially the substitutions s->f and ss->fl.

Having done that, I built a couple of modules that mine the dataset for patterns. Last December, I was intrigued to discover that words with close semantic relationships tend to track each other closely (using simple sensory examples like the names of colors and oppositions like hot/cold). I suspected that this pattern might extend to more abstract concepts as well, but it’s difficult to explore that hypothesis if you have to test possible instances one by one. The correlation-seeking module has made it possible to explore it more rapidly, and has also put some numbers on what was before a purely visual sense of “fittedness.”

For instance, consider “diction.” It turns out that the closest correlate to “diction” in the period 1700-1899 is “versification,” which has a Pearson correlation coefficient of 0.87. (If this graph doesn’t seem to match the Google version, remember that the ngram viewer is useless in the 18c until you correct for case and long s.)

The other words that correlate most closely with “diction” are all similarly drawn from the discourse of poetic criticism. “Poem” and “stanzas” have a coefficient of 0.82; “poetical” is 0.81. It’s a bit surprising that correlation of yearly frequencies should produce such close thematic connections. Obviously, a given subject category will be overrepresented in certain years, and underrepresented in others, so thematically related words will tend to vary together. But in a corpus as large and diverse as Google’s, one might have expected that subtle variation to be swamped by other signals. In practice it isn’t.

I’ve also built a module that looks for words that are overrepresented in a given period relative to the rest of 1700-1899. The measures of overrepresentation I’m using are a bit idiosyncratic. I’m simply comparing the mean frequency inside the period to the mean frequency outside it. I take the natural log of the absolute difference between those means, and multiply it by the ratio (frequency in the period/frequency outside it). For the moment, that formula seems to be working; I’ll try other methods (log-likelihood, etc.) later on.

Once I find a list of, say, fifty words that are overrepresented in a period, I can generate a correlation matrix based on their correlations with each other, and then do hierarchical clustering on that matrix to reveal which words track each other most closely. In effect, I create a broad list of “trending topics” in a particular period, and then use a more precise sort of curve-matching to define the relationships between those trends.

One might imagine that matching words on the basis of change-across-time would be a blunt instrument compared to a more intimate approach based on collocation in the same sentences, or at least co-occurrence in the same volumes. And for many purposes that will be true. But I’ve found that my experiments with smaller-scale co-occurrence (e.g. in MONK) often lead me into tautological dead ends. I’ll discover, e.g., that historical novels share the kind of vocabulary I sort of knew historical novels were likely to share. Relying on yearly frequency data makes it easier to avoid those dead ends, because they have the potential to turn up patterns that aren’t based purely on a single familiar genre or subject category. They may be a blunt instrument, but through their very bluntness they allow us to back up to a vantage point where it’s possible to survey phenomena that are historical rather than purely semantic.

I’ve included an example below. The clusters that emerge here are based on a kind of semantic connection, but often it’s a connection that only makes sense in the context of the period. For instance, “nitrous” and “inflammable” may seem a strange pairing, unless you know that the recently-discovered gas hydrogen was called “inflammable air,” and that chemists were breathing nitrous oxide, aka laughing gas. “Sir” and “de” may seem a strange pairing, unless you reflect that “de” is a French particle of nobility analogous to “sir,” and so on. But I also find that I’m discovering a lot here I didn’t previously know. For instance, I probably should have guessed that Petrarch was a big deal in this period, since there was a sonnet revival — but that’s not something I actually knew, and it took me a while to figure out why Petrarch was coming up. I still don’t know why he’s connected to the dramatist Charles Macklin.

There are lots of other fun pairings in there, especially britain/commerce/islands and the group of flashy hyperbolic adverbs totally/frequently/extremely connected to elegance/opulence. I’m not sure that I would claim a graph like this has much evidentiary value; clustering algorithms are sensitive to slight shifts in initial conditions, so a different list of words might produce different groupings. But I’m also not sure that evidentiary value needs to be our goal. Lately I’ve been inclined to argue that the real utility of text mining may be as a heuristic that helps us discover puzzling questions. I certainly feel that a graph like this helps me identify topics (and more subtly, kinds of periodized diction) that I didn’t recognize before, and that deserve further exploration. [UPDATE 4/20/2011: Back in February I was doing this clustering with yearly frequency data, and Pearson’s correlation, which worked surprisingly well. But I’m now fairly certain that it’s better to do it with co-occurrence data, and a vector space model. See this more recent post.]