A touching detail produced by LDA …

I’m getting ahead of myself with this post, because I don’t have time to explain everything I did to produce this. But it was just too striking not to share.

Basically, I’m experimenting with Latent Dirichlet Allocation, and I’m impressed. So first of all, thanks to Matt Jockers, Travis Brown, Neil Fraistat, and everyone else who tried to convince me that Bayesian methods are better. I’ve got to admit it. They are.

But anyway, in a class I’m teaching we’re using LDA on a generically diverse collection of 1,853 volumes between 1751 and 1903. The collection includes fiction, poetry, drama, and a limited amount of nonfiction (just biography). We’re stumbling on a lot of fascinating things, but this was slightly moving. Here’s the graph for one particular topic.

Image of a topic.
The circles and X’s are individual volumes. Blue is fiction, green is drama, pinkish purple is poetry, black biography. Only the volumes where this topic turned out to be prominent are plotted, because if you plot all 1,853 it’s just a blurry line at the bottom of the image. The gray line is an aggregate frequency curve, which is not related in any very intelligible way to the y-axis. (Work in progress …) As you can see. this topic is mostly prominent in fiction around the year 1800. Here are the top 50 words in the topic:


But here’s what I find slightly moving. The x’s at the top of the graph are the 10 works in the collection where the topic was most prominent. They include, in order, Mary Wollstonecraft Shelley, Frankenstein, Mary Wollstonecraft, Mary, William Godwin, St. Leon, Mary Wollstonecraft Shelley, Lodore, William Godwin, Fleetwood, William Godwin, Mandeville, and Mary Wollstonecraft Shelley, Falkner.

In short, this topic is exemplified by a family! Mary Hays does intrude into the family circle with Memoirs of Emma Courtney, but otherwise, it’s Mary Wollstonecraft, William Godwin, and their daughter.

Other critics have of course noticed that M. W. Shelley writes “Godwinian novels.” And if you go further down the list of works, the picture becomes less familial (Helen Maria Williams and Thomas Holcroft butt in, as well as P. B. Shelley). Plus, there’s another topic in the model (“myself these should situation”) that links William Godwin more closely to Charles Brockden Brown than it does to his wife or daughter. And LDA isn’t graven on stone; every time you run topic modeling you’re going to get something slightly different. But still, this is kind of a cool one. “Mind feelings heart felt” indeed.

Etymology and nineteenth-century poetic diction; or, singing the shadow of the bitter old sea.

In a couple of recent posts, I argued that fiction and poetry became less similar to nonfiction prose over the period 1700-1900. But because I only measured genres’ distance from each other, I couldn’t say much substantively about the direction of change. Toward the end of the second post, though, I did include a graph that hinted at a possible cause:


The older part of the lexicon (mostly words derived from Old English) gradually became more common in poetry, fiction, and drama than in nonfiction prose. This may not be the only reason for growing differentiation between literary and nonliterary language, but it seems worth exploring. (I should note that function words are excluded from this calculation for reasons explained below; we’re talking about verbs, nouns, and adjectives — not about a rising frequency of “the.”)

Why would genres become etymologically different? Well, it appears that words of different origins are associated in contemporary English with different registers (varieties of language appropriate for a particular social situation). Words of Old English provenance get used more often in speech than in writing — and in writing they are (now) used more often in narrative than in exposition. Moreover, writers learn to produce this distinction as they get older; there isn’t a marked difference for students in elementary school. But as they advance to high school, students learn to use Latinate words in formal expository writing (Bar-Ilan and Berman, 2007).

It’s not hard to see why words of Old English origin might be associated with spoken language. English was for 200 years (1066-1250) almost exclusively spoken. The learned part of the Old English lexicon didn’t survive this period. Instead, when English began to be used again in writing, literate vocabulary was borrowed from French and Latin. As a result, etymological distinctions in English tend also to be distinctions between different social contexts of language use.

Instead of distinguishing “Germanic” and “Latinate” diction here, I have used the first attested date for each word, choosing 1150 as a dividing line because it’s the midpoint of the period when English was not used in writing. Of course pre-1150 words are mostly from Old English, but I prefer to divide based on date-of-entry because that highlights the history of writing rather than a spurious ethnic mystique. (E.g., “Anglo-Saxon is a livelier tongue than Latin, so use Anglo-Saxon words.” — E. B. White.) But the difference isn’t material. You could even just measure the average length of words in different genres and get results that are close to the results I’m graphing here (the correlation between the pre/post-1150 ratio and average word length is often -.85 or lower).

The bottom line is this: using fewer pre-1150 words tends to make diction more overtly literate or learned. Using more of them makes diction less overtly learned, and perhaps closer to speech. It would be dangerous to assume much more: people may think that Old English words are “concrete” — but this isn’t true, for instance, of “word” or “true.”

What can we learn by graphing this aspect of diction?


In the period 1700-1900, I think we learn three interesting things:

    All genres of writing (or at least of prose) seem to acquire an exaggeratedly “literate” diction in the course of the eighteenth century.

    Poetry and fiction reverse that process in the nineteenth century, and develop a diction that is markedly less learned than other kinds of writing — or than their own past history.

    But they do that to different degrees, and as a result the overall story is one of increasing differentiation — not just between “literary” and “nonliterary” diction — but between poetry and fiction as well.

I’m fascinated by this picture. It suggests that the difference linguists have observed between the registers of exposition and narrative may be a relatively recent development. It also raises interesting questions about “literariness” in the eighteenth and nineteenth centuries. For instance, contrast this picture to the standard story where “poetic diction” is an eighteenth-century refinement that the nineteenth century learns to dispense with. Where the etymological dimension of diction is concerned, that story doesn’t fit the evidence. On the contrary, nineteenth-century poetry differentiates itself from the diction of prose in a new and radical way: by the end of the century, the older part of the lexicon has become more than 2.5 times more prominent, on average, in verse than it is in nonfiction prose.

I could speculate about why this happened, but I don’t really know yet. What I can do is give a little more descriptive detail. For instance, if pre-1150 words became more common in 19c poetry … which words, exactly, were involved? One way to approach that is to ask which individual words correlate most strongly with the pre/post-1150 ratio. We might focus especially, for instance, on the rising trend in poetry from the middle of the eighteenth century to 1900. If you sort the top 10,000 words in the poetry collection by correlation with yearly values of the pre/post ratio, you get a list like this:


But the precise correlation coefficients don’t matter as much as an overall picture of diction, so I’ll simply list the hundred words that correlate most strongly with the pre/post-1150 ratio in poetry from 1755 to 1900:


We’re looking mostly at a list of pre-1150 words, with a few exceptions (“face,” “flower,” “surely”). That’s not an inevitable result; if the etymological trend had been a side-effect of something mostly unrelated to linguistic register (say, a vogue for devotional poetry), then sorting the top 10,000 words by correlation with the trend would reveal a list of words associated with its underlying (religious) cause. But instead we’re seeing a trend that seems to have a coherent sociolinguistic character. That’s not just a feature of the top 100 words: the average pre-1150 word is located 2210 places higher on this list than the average post-1150 word.

It’s not, however, simply a list of common Anglo-Saxon words. The list clearly reflects a particular model of “poetic diction,” although the nature of that model is not easy to describe. It involves an odd mixture of nouns for large natural phenomena (wind, sea, rain, water, moon, sun, star, stars, sunset, sunrise, dawn, morning, days, night, nights) and verbs that express a subjective relation (sang, laughed, dreamed, seeing, kiss, kissed, heard, looked, loving, stricken). [Afterthought: I don’t think we have any Hopkins in our collection, but it sounds like my computer is parodying Gerard Manley Hopkins.] There’s also a bit of explicitly archaic Wardour Street in there (yea, nay, wherein, thereon, fro).

Here, by contrast, are the words at the bottom of the list — the ones that correlate negatively with the pre/post-1150 trend, because they are less common, on average, in years where that trend spikes.


There’s a lot that could be said about this list, but one thing that leaps out is an emphasis on social competition. Pomp, power, superior, powers, boast, bestow, applause, grandeur, taste, pride, refined, rival, fortune, display, genius, merit, talents. This is the language of poems that are not bashful about acknowledging the relationship between “social” distinction and the “arts” “inspired” by the “muse” — a theme entirely missing (or at any rate disavowed) in the other list. So we’re getting a fairly clear picture of a thematic transformation in the concerns of poetry from 1755 to 1900. But these lists are generated strictly by correlation with the unsmoothed year-to-year variation of an etymological trend! Moreover, the lists have themselves an etymological character. There are admittedly a few pre-1150 words in this list of negative correlators (mind, oft, every), but by and large it’s a list of words derived from French or Latin after 1150.

I think the apparent connection between sociolinguistic and thematic issues is the really interesting part of this. It begins to hint that the broader shift in poetic diction (using words from the older part of the lexicon to differentiate poetry from prose) had itself an unacknowledged social rationale — which was to disavow poetry’s connection to cultural distinction, and foreground instead a simplified individual subjectivity. I’m admittedly speculating a little here, and there’s a great deal more that could be said both about poetry and about parallel trends in fiction — but I’ve said enough for one blog post.

A couple of quick final notes. You’re wondering, what about drama?


Our collection of drama in the nineteenth century is actually too sparse to draw any conclusions yet, but there’s the trend line so far, if you’re interested.

You’re also wondering, how were pre- and post-1150 words actually sorted out? I made a list of the 10,500 most common words in the collection, and mined etymologies for them using a web-crawler on Dictionary.com. I excluded proper nouns, abbreviations, and words that entered English after 1699. I also excluded function words (determiners, prepositions, conjunctions, pronouns, and the verb to be) because as Bar-Ilan and Berman say, “register variation is essentially a matter of choice — of selecting high-level more formal alternatives instead of everyday, colloquial items or vice versa” (15). There is generally no alternative to prepositions, pronouns, etc, so they don’t tell us much about choice. After those exclusions, I had a list of 9,517 words, of which 2,212 entered the language before 1150 and 7,125 after 1149. (The list is available here.)

Finally, I doubt we’ll be so lucky — but if you do cite this blog post, it should be cited as a collective work by Ted Underwood and Jordan Sellers, because the nineteenth-century part of the underlying collection is a product of Jordan’s research.

References
Laly Bar-Ilan and Ruth A. Berman, “Developing register differentiation: the Latinate-Germanic divide in English,” Linguistics 45 (2007): 1-35.

[UPDATE March 13, 2011: Twitter conversation about this post with Natalia Cecire.]

Big but not distant.

Big data. I’m tempted to begin “I, too, dislike it,” because the phrase has become a buzzword. To mainstream humanists, it sounds like a perversion. Even people who work in digital humanities protest that DH shouldn’t be normatively identified with big data — and I agree — so generally I keep quiet on the whole vexed question.

Except … there are a lot of grad students out there just starting to look at DH curiously, wondering whether it offers anything useful for their own subfield. In that situation, it’s natural to start by building a small collection that addresses a specific research problem you know about. And that might, in many cases, be a fine approach! But my conscience is nagging at me, because I can see some other, less obvious opportunities that students ought to be informed about.

It’s true that DH doesn’t have to be identified with scale. But the fact remains that problems of scale constitute a huge blind spot for individual researchers, and also define a problem that we know computers can help us explore. And when you first go into an area that was a blind spot for earlier generations of scholars, you’re almost guaranteed to find research opportunities — lying out on the ground like lumps of gold you don’t have to mine.

I'm just saying.


This suggests that it might be a mistake to assume that the most cost-effective way to get started in DH is to define a small collection focused on a particular problem you know about. It might actually be a better strategy to beg, borrow, or steal a large collection — and poke around in it for problems we don’t yet know about.

“But I’m not interested in big statistical generalizations; I care about describing individual works, decades, and social problems.” I understand; that’s a valid goal; but it’s not incompatible with the approach I’m recommending. I think it’s really vital that we do a better job of distinguishing “big data” (the resource) from “distant reading” (a particular interpretive strategy).* Big data doesn’t have to produce distant generalizations; we can use the leverage provided by scale and comparative analysis to crack open small and tightly-focused questions.

I don’t think most humanists have an intuitive grasp of how that “leverage” would work — but topic modeling is a good example. As I play around with topic-modeling large collections, I’m often finding that the process tells me interesting things about particular periods, genres, or works, by revealing how they differ from other relevant points of comparison. Topic modeling doesn’t use scale to identify a “trend” or an “average,” after all; what it does is identify the most salient dimensions of difference in a given collection. If you believe that the significance of a text is defined by its relation to context, then you can see how topic modeling a collection might help us crack open the (relational) significance of individual works.

“But how do we get our hands on the data?” Indeed: there’s the rub. Miriam Posner has recently suggested that the culture surrounding “coding” serves as a barrier that discourages women and minorities from entering certain precincts of DH. I think that’s right, but I’m even more concerned about the barriers embodied in access to data. Coding is actually not all that hard to pick up. Yes, it’s surrounded by gendered assumptions; but still, you can do it over a summer. [Update: Or, where that’s not practical, you can collaborate with someone. At Illinois, Loretta Auvil and Boris Capitanu do kinds of DH programming that are beyond me. I don’t mean to minimize issues of gender here, but I do mean to put “coding” in perspective. It’s not a mysterious, magical key.] By contrast, none of us can build big data on our own (or even in small teams) over the summer. If we don’t watch out, our field could easily slip into a situation where power gravitates to established scholars at large/wealthy research universities.

I’ve tried to address that by making my own data public. I haven’t documented it very well yet, but give me a few weeks. I think peer pressure should be exerted on everyone (especially established scholars) to make their data public at the time of publication. I do understand that some kinds of data can’t be shared because they’re owned by private enterprise. I accept that. But if you’ve supplemented proprietary data with other things you’ve produced on your own: in my opinion, that data should be made public at the time of publication.

Moreover, if you do that, I’m not going to care very much about the mistakes you have made in building your collection. I may think your data is completely biased and unrepresentative, because it includes too much Y and not enough X. But if so, I have an easy solution — which is to take your data, add it to my own collection of X, and other data borrowed from Initiative Z, and then select whatever subset would in my opinion create a balanced and representative collection. Then I can publish my own article correcting your initial, biased result.

Humanists are used to approaching debates about historical representation as if they were zero-sum questions. I suppose we are on some level still imagining this as a debate about canonicity — which is, as John Guillory pointed out, really a debate about space on the syllabus. Space on the syllabus is a zero-sum game. But the process of building big data is not zero-sum; it is cumulative. Every single thing you digitize is more good news for me, even if I shudder at the tired 2007-vintage assumptions implicit in your research agenda.

Personally, I feel the same way about questions of markup and interoperability. It’s all good. If you can give me clean** ascii text files with minimal metadata, I love you. If you can give me TEI with enriched metadata, I love you. I don’t want to waste a lot of breath arguing about which standard is better. In most cases, clean ascii text would be a lot better than what I can currently get.

* I hasten to say that I’m using “distant reading” here as the phrase is commonly deployed in debate — not as Franco Moretti originally used it — because the limitation I’m playing on is not really present in Moretti’s own use of the term. Moretti pointedly emphasizes that the advantage of a distant perspective may be to reveal the relational significance of an individual work.

** And, when I say “clean” — I will definitely settle for a 5% error rate.

References
Guillory, John. Cultural Capital. Chicago: U. of Chicago Press, 1993.
Moretti, Franco. Graphs, Maps, Trees. New York: Verso, 2005.

[UPDATE: For a different perspective on the question of representativeness, see Katherine D. Harris on Big Data, DH, and Gender. Also, see Roger Whitson, who suggests that linked open data may help us address issues of representation.]

Literary and nonliterary diction, the sequel.

In my last post, I suggested that literary and nonliterary diction seem to have substantially diverged over the course of the eighteenth and nineteenth centuries. The vocabulary of fiction, for instance, becomes less like nonfiction prose at the same time as it becomes more like poetry.

It’s impossible to interpret a comparative result like this purely as evidence about one side of the comparison. We’re looking at a process of differentiation that involves changes on both sides: the language of nonfiction and fiction, for instance, may both have specialized in different ways.

This post is partly a response to very helpful suggestions I received from commenters, both on this blog and at Language Log. It’s especially a response to Ben Schmidt’s effort to reproduce my results using the Bookworm dataset. I also try two new measures of similarity toward the end of the post (cosine similarity and etymology) which I think interestingly sharpen the original hypothesis.

I have improved my number-crunching in four main ways (you can skip these if you’re bored):

1) In order to normalize corpus size across time, I’m now comparing equal-sized samples. Because the sample sizes are small relative to the larger collection, I have been repeating the sampling process five times and averaging results with a Fisher’s r-to-z transform. Repeated sampling doesn’t make a huge difference, but it slightly reduces noise.

2) My original blog post used 39-year slices of time that overlapped with each other, producing a smoothing effect. Ben Schmidt persuasively suggests that it would be better to use non-overlapping samples, so in this post I’m using non-overlapping 20-year slices of time.

3) I’m now running comparisons on the top 5,000 words in each pair of samples, rather than the top 5,000 words in the collection as a whole. This is a crucial and substantive change.

4) Instead of plotting a genre’s similarity to itself as a flat line of perfect similarity at the top of each plot, I plot self-similarity between two non-overlapping samples selected randomly from that genre. (Nick Lamb at Language Log recommended this approach.) This allows us to measure the internal homogeneity of a genre and use it as a control for the differentiation between genres.

Briefly, I think the central claims I was making in my original post hold up. But the constraints imposed by this newly-rigorous methodology have forced me to focus on nonfiction, fiction, and poetry. Our collections of biography and drama simply aren’t large enough yet to support equal-sized random samples across the whole period.

Here are the results for fiction compared to nonfiction, and nonfiction compared to itself.


This strongly supports the conclusion that fiction was becoming less like nonfiction, but also reveals that the internal homogeneity of the nonfiction corpus was decreasing, especially in the 18c. So some of the differentiation between fiction and nonfiction may be due to the internal diversification of nonfiction prose.

By contrast, here are the results for poetry compared to fiction, and fiction compared to itself.

Poetry and fiction are becoming more similar in the period 1720-1900. I should note that I’ve dropped the first datapoint, for the period 1700-1719, because it seemed to be an outlier. Also, we’re using a smaller sample size here, because my poetry collection won’t support 1 million word samples across the whole period. (We have stripped the prose introduction and notes from volumes of poetry, so they’re small.)

Another question that was raised, both by Ben and by Mark Liberman at Language Log, involved the relationship between “diction” and “topical content.” The Spearman correlation coefficient gives common and uncommon words equal weight, which means (in effect) that it makes no effort to distinguish style from content.

But there are other ways of contrasting diction. And I thought I might try them, because I wanted to figure out how much of the growing distance between fiction and nonfiction was due simply to the topical differentiation of nonfiction in this period. So in the next graph, I’m comparing the cosine similarity of million-word samples selected from fiction and nonfiction to distinct samples selected from nonfiction. Cosine similarity is a measure that, in effect, gives more weight to common words.


I was surprised by this result. When I get very stable numbers for any variable I usually assume that something is broken. But I ran this twice, and used the same code to make different comparisons, and the upshot is that samples of nonfiction really are very similar to other samples of nonfiction in the same period (as measured by cosine similarity). I assume this is because the growing topical heterogeneity that becomes visible in Spearman’s correlation makes less difference to a measure that focuses on common words. Fiction is much more diverse internally by this measure — which makes sense, frankly, because the most common words can be totally different in first-person and third-person fiction. But — to return to the theme of this post — the key thing is that there’s a dramatic differentiation of fiction and nonfiction in this period. Here, by contrast, are the results for nonfiction and poetry compared to fiction, as well as fiction compared to itself.

This graph is a little wriggly, and the underlying data points are pretty bouncy — because fiction is internally diverse when measured by cosine similarity, and it makes a rather bouncy reference point. But through all of that I think one key fact does emerge: by this measure, fiction looks more similar to nonfiction prose in the eighteenth century, and more similar to poetry in the nineteenth.

There’s a lot more to investigate here. In my original post I tried to identify some of the words that became more common in fiction as it became less like nonfiction. I’d like to run that again, in order to explain why fiction and poetry became more similar to each other. But I’ll save that for another day. I do want to offer one specific metric that might help us explain the differentiation of “literary” and “nonliterary” diction: the changing etymological character of the vocabulary in these genres.


Measuring the ratio of “pre-1150” to “post-1150” words is roughly like measuring the ratio of “Germanic” to “Latinate” diction, except that there are a number of pre-1150 words (like “school” and “wall”) that are technically “Latinate.” So this is essentially a way of measuring the relative “familiarity” or “informality” of a genre (Bar-Ilan and Berman 2007). (This graph is based on the top 10k words in the whole collection. I have excluded proper nouns, words that entered the language after 1699, and stopwords — determiners, pronouns, conjunctions, and prepositions.)

I think this graph may help explain why we have the impression that literary language became less specialized in this period. It may indeed have become more informal — perhaps even closer to the spoken language. But in doing so it became more distinct from other kinds of writing.

I’d like to thank everyone who responded to the original post: I got a lot of good ideas for collection development as well as new ways of slicing the collection. Katherine Harris, for instance, has convinced me to add more women writers to the collection; I’m hoping that I can get texts from the Brown Women Writers Project. This may also be a good moment to reiterate that the nineteenth-century part of the collection I’m working with was selected by Jordan Sellers, and these results should be understood as built on his research. Finally, I have put the R code that I used for most of these plots in my Open Data page, but it’s ugly and not commented yet; prettier code will appear later this weekend.

References
Laly Bar-Ilan and Ruth A. Berman, “Developing register differentiation: the Latinate-Germanic divide in English,” Linguistics 45 (2007): 1-35.