18c 19c collection-building genre comparison

Literary and nonliterary diction, the sequel.

In my last post, I suggested that literary and nonliterary diction seem to have substantially diverged over the course of the eighteenth and nineteenth centuries. The vocabulary of fiction, for instance, becomes less like nonfiction prose at the same time as it becomes more like poetry.

It’s impossible to interpret a comparative result like this purely as evidence about one side of the comparison. We’re looking at a process of differentiation that involves changes on both sides: the language of nonfiction and fiction, for instance, may both have specialized in different ways.

This post is partly a response to very helpful suggestions I received from commenters, both on this blog and at Language Log. It’s especially a response to Ben Schmidt’s effort to reproduce my results using the Bookworm dataset. I also try two new measures of similarity toward the end of the post (cosine similarity and etymology) which I think interestingly sharpen the original hypothesis.

I have improved my number-crunching in four main ways (you can skip these if you’re bored):

1) In order to normalize corpus size across time, I’m now comparing equal-sized samples. Because the sample sizes are small relative to the larger collection, I have been repeating the sampling process five times and averaging results with a Fisher’s r-to-z transform. Repeated sampling doesn’t make a huge difference, but it slightly reduces noise.

2) My original blog post used 39-year slices of time that overlapped with each other, producing a smoothing effect. Ben Schmidt persuasively suggests that it would be better to use non-overlapping samples, so in this post I’m using non-overlapping 20-year slices of time.

3) I’m now running comparisons on the top 5,000 words in each pair of samples, rather than the top 5,000 words in the collection as a whole. This is a crucial and substantive change.

4) Instead of plotting a genre’s similarity to itself as a flat line of perfect similarity at the top of each plot, I plot self-similarity between two non-overlapping samples selected randomly from that genre. (Nick Lamb at Language Log recommended this approach.) This allows us to measure the internal homogeneity of a genre and use it as a control for the differentiation between genres.

Briefly, I think the central claims I was making in my original post hold up. But the constraints imposed by this newly-rigorous methodology have forced me to focus on nonfiction, fiction, and poetry. Our collections of biography and drama simply aren’t large enough yet to support equal-sized random samples across the whole period.

Here are the results for fiction compared to nonfiction, and nonfiction compared to itself.

This strongly supports the conclusion that fiction was becoming less like nonfiction, but also reveals that the internal homogeneity of the nonfiction corpus was decreasing, especially in the 18c. So some of the differentiation between fiction and nonfiction may be due to the internal diversification of nonfiction prose.

By contrast, here are the results for poetry compared to fiction, and fiction compared to itself.

Poetry and fiction are becoming more similar in the period 1720-1900. I should note that I’ve dropped the first datapoint, for the period 1700-1719, because it seemed to be an outlier. Also, we’re using a smaller sample size here, because my poetry collection won’t support 1 million word samples across the whole period. (We have stripped the prose introduction and notes from volumes of poetry, so they’re small.)

Another question that was raised, both by Ben and by Mark Liberman at Language Log, involved the relationship between “diction” and “topical content.” The Spearman correlation coefficient gives common and uncommon words equal weight, which means (in effect) that it makes no effort to distinguish style from content.

But there are other ways of contrasting diction. And I thought I might try them, because I wanted to figure out how much of the growing distance between fiction and nonfiction was due simply to the topical differentiation of nonfiction in this period. So in the next graph, I’m comparing the cosine similarity of million-word samples selected from fiction and nonfiction to distinct samples selected from nonfiction. Cosine similarity is a measure that, in effect, gives more weight to common words.

I was surprised by this result. When I get very stable numbers for any variable I usually assume that something is broken. But I ran this twice, and used the same code to make different comparisons, and the upshot is that samples of nonfiction really are very similar to other samples of nonfiction in the same period (as measured by cosine similarity). I assume this is because the growing topical heterogeneity that becomes visible in Spearman’s correlation makes less difference to a measure that focuses on common words. Fiction is much more diverse internally by this measure — which makes sense, frankly, because the most common words can be totally different in first-person and third-person fiction. But — to return to the theme of this post — the key thing is that there’s a dramatic differentiation of fiction and nonfiction in this period. Here, by contrast, are the results for nonfiction and poetry compared to fiction, as well as fiction compared to itself.

This graph is a little wriggly, and the underlying data points are pretty bouncy — because fiction is internally diverse when measured by cosine similarity, and it makes a rather bouncy reference point. But through all of that I think one key fact does emerge: by this measure, fiction looks more similar to nonfiction prose in the eighteenth century, and more similar to poetry in the nineteenth.

There’s a lot more to investigate here. In my original post I tried to identify some of the words that became more common in fiction as it became less like nonfiction. I’d like to run that again, in order to explain why fiction and poetry became more similar to each other. But I’ll save that for another day. I do want to offer one specific metric that might help us explain the differentiation of “literary” and “nonliterary” diction: the changing etymological character of the vocabulary in these genres.

Measuring the ratio of “pre-1150” to “post-1150” words is roughly like measuring the ratio of “Germanic” to “Latinate” diction, except that there are a number of pre-1150 words (like “school” and “wall”) that are technically “Latinate.” So this is essentially a way of measuring the relative “familiarity” or “informality” of a genre (Bar-Ilan and Berman 2007). (This graph is based on the top 10k words in the whole collection. I have excluded proper nouns, words that entered the language after 1699, and stopwords — determiners, pronouns, conjunctions, and prepositions.)

I think this graph may help explain why we have the impression that literary language became less specialized in this period. It may indeed have become more informal — perhaps even closer to the spoken language. But in doing so it became more distinct from other kinds of writing.

I’d like to thank everyone who responded to the original post: I got a lot of good ideas for collection development as well as new ways of slicing the collection. Katherine Harris, for instance, has convinced me to add more women writers to the collection; I’m hoping that I can get texts from the Brown Women Writers Project. This may also be a good moment to reiterate that the nineteenth-century part of the collection I’m working with was selected by Jordan Sellers, and these results should be understood as built on his research. Finally, I have put the R code that I used for most of these plots in my Open Data page, but it’s ugly and not commented yet; prettier code will appear later this weekend.

Laly Bar-Ilan and Ruth A. Berman, “Developing register differentiation: the Latinate-Germanic divide in English,” Linguistics 45 (2007): 1-35.

By tedunderwood

Ted Underwood is Professor of Information Sciences and English at the University of Illinois, Urbana-Champaign. On Twitter he is @Ted_Underwood.

18 replies on “Literary and nonliterary diction, the sequel.”

This is really fascinating stuff. Would it be too reductive to say you’re describing a shift in the set from ‘prosepoetry’ as the major binary towards ‘factualliterary,’ as fiction changes its allegiance from nonfiction to poetry?

I very much like the latinate/Germanic divide as an explanation, too. Do you think those same changes are driving the Spearman differences as well, or is there a different type of vocabulary shift that could give the macro explanation?

One thing I’d love to see clarified about the code is that it is defined in terms of ‘segments’; what’s the relation between a segment and a book? As Natalie Houston said last time, different sizes in genres will cloud things up a bit: the distance between 2 corpuses of 10 books @ 100,000 words each will be less than that between 2 of 1 book @ 1,000,000 words each, though I doubt the effects are nearly as large as the corpus size.

Re: your characterization of the shift … I don’t know yet whether that works. It might work, and it interestingly parallels what W. Wordsworth has to say in his “Preface” to Lyrical Ballads — that the opposite of poetry is “not prose, but matter of fact or science.” But I also see a lot of other ways we could characterize this, and I don’t know how to decide yet.

I’ll try to figure out how much correlation there is between the Germanic/Latinate ratio and the Spearman shift. But candidly, I expected to see a lot of Germanic words last time when I looked for vocabulary that was becoming more overrepresented in fiction. And instead I saw subjectivity. So again, I don’t know.

The segment thing I can clarify. A segment can be either a whole volume, or — in the case of large volumes — it can be a chunk of between 50k and 100k words. (I split them up this way b/c I thought it would help with topic modeling.) It definitely is the case that the number of segments is smaller, overall, in the 19c than in the 18c, because books get bigger and there are fewer segments per sample. But I don’t know how to exclude this source of distortion w/out introducing other distortions. Plus, I believe it would affect all genres more or less equally.

Having roughly equal-sized (or even just smaller than 100K words) segments is good; one of the problems with my analysis, I think, is that some of the late 19C biographies and novels can be 800K words long, which induces some unpredictable swings. As long as 10 different books are making it in to the sample, I doubt this would be a big problem. (One minor tweak might be to throw out all but one segment for each book before constructing your comparison corpora, so you don’t end up comparing 3 corpora that each have 4 100K-word chunks of Gibbon in them).

Sorry to keep nitpicking the methodology like this–I’m envious you have a corpus that gives such clean results, else I’d do it more on my own.

No problem — I appreciate the suggestions, which have been very valuable.

I suspect there are even bigger issues I’ll have to deal with before this can come out on paper. Right now the 18c part of our collection comes mostly from TCP and the 19c mostly from our own selection. To my eyes, the results we’re getting are way too coherent and continuous to be explained by a break at 1800 … but I bet that’s the main source of skepticism I’ll have to address. (Aside from the sheer revulsion my more traditionally-minded colleagues are going to feel looking at all these graphs …)

Great posts, the both of you! Regarding the comments on smoothing, I think that letting the jitters in the data speak for themselves (using non-overlapping samples) is not the best way to go about this. It assumes that language use is a continuous function of time and further that your sample represents that continuous function (rather than coming from two completely different data sources, as Ted’s does), but if that isn’t the case, you’ll get “edge” effects which are artifacts of how you slice the data. In a sample this size those effects might not be big enough to matter, but I’d suggest offsetting your analysis by 10 years, or letting your samples overlap a few years at either end. Presumably the same trends will show, but it’s worth double-checking.

Non-overlapping time series trend analysis, in general, biases outliers in particular periods, especially if the sample is imperfect or if the data are somehow time-dependent.

Ah, good. I wasn’t really sure what the statistical authorities had to say about that point, but I like this way of viewing things. Especially since it produces prettier graphs!

Scott, I’m not sure I’m completely understanding this: “[non-overlapping samples] assumes that language use is a continuous function of time and further that your sample represents that continuous function (rather than coming from two completely different data sources, as Ted’s does), but if that isn’t the case, you’ll get “edge” effects which are artifacts of how you slice the data.” I would think if anything, it’s overlapping samples that assume continuity. Could you give an example of where edge effects will be bad? I primarily come across spurious edges from outliers when using overbroad moving windows, and I’m not sure I can image a bad one coming from non-overlapping samples. (I can imagine good ones, like a break in the data where source composition changes).

I’m interested in this b/c you’re right that there’s something very odd about assumptions of continuous change built into arraying time on a graph like this, in a way that cuts strongly against historical notions of time. It’s very interesting to talk about smooth transitions; but I want to to see sampling methods that don’t automatically produce smooth contours regardless of the underlying data. For example, Ted, in the first graph of your first post on this there’s a steep steady drop in drama from 1780 to 1820, with relative levels before and after; there’s no way to know to if that’s actually a continuous change, or a single massive shift in 1800 smoothed out over the 40-year window. Non-overlapping samples with subsequent smoothing seem like the best way to escape this to me.

I’m interested in hearing Scott’s reply to the methodological question, which seems interestingly complex to me. But I will say that your example is a good one, Ben, and is exactly why I was persuaded to try non-overlapping samples. In this instance it did at first look very possible that there was a discrete shift around 1800 that just got “smoothed out.” And given my data sources that would have been a problem! On the other hand, I can imagine a situation where non-overlapping samples create an outlier that distorts the loess curve (loess smoothing can be pretty sensitive to outliers). Maybe there’s some practical middle ground. In my first set of graphs, the samples were overlapping by almost 90% of their width, and that’s probably too much. But I can imagine that it might be wise to permit a few years of overlap … ?

Let’s say, for example, that your data were generated from a periodic sin-type function that went peaks every 17 years. You have about 10 data points, averaged samples, over a period of 200 years, so one point every twenty years. Because you’re looking at different (non-synchronized) windows of a periodic function, some chunks will appear on average much higher than others, even though the mean of the whole function is effectively zero. The results you get wind up being an artifact of how the data are sliced, because the data themselves are time-dependent and your slicing didn’t take that into account. If you extend slices into neighboring windows, that effect is mitigated.

That being said, the continuous/discontinuous issue is a tough one. If the underlying “ground-truth” of literature is continuous, but the dataset (Ted’s two databases) is discontinuous, overlapping samples will help mitigate the issue. That is, we can’t really tell if the steep change between the 18th and 19th century drama is because of the dataset change-over, or because of some underlying effect.

There’s a lot to tease apart here. The underlying “ground truth” may or may not be continuous; Ted’s available data may or may not be continuous; the data may or may not be a function of time. Assumptions of the answers of each of those questions (especially assumptions of whether or not massive events happen at the intersection of them, like a discontinuity around 1800 at the same time as the dataset shift) will determine which method is most appropriate.

When I said “assumes that language use is a continuous function of time and further that your sample represents that continuous function,” I was thinking that, *if* the underlying “ground-truth” were continuous but the dataset sample was not, non-overlapping samples would bias the dataset shift and possibly the time-slice criteria. If instead, as you say, the underlying data are discontinuous, we want to avoid pre-smoothing wherever possible.

I haven’t the foggiest idea what the solution is… possibly, if you run the analysis a few times, taking non-overlapping samples but offsetting the samples differently each time (starting in 1700, 1706, 1714, …), and you find that the overarching trends remain the same, that’d be a good start.

Also, we need some sort of group or list or society for algorithm-heavy humanities. Does this exist? These are invariably the most fun sorts of conversations I have over the course of any given week.

I know — I really enjoy chatting with you about Bayes, etc. As far as I know, the institutions associated with DH in general are the only “groups or lists or societies” that foster “algorithm-heavy” discussion. However, I certainly plan to keep reading your blog and Ben’s … and perhaps I should make a twitter list to make sure I don’t miss things you post …

Comments are closed.