Categories
18c 19c collection-building genre comparison

Literary and nonliterary diction, the sequel.

In my last post, I suggested that literary and nonliterary diction seem to have substantially diverged over the course of the eighteenth and nineteenth centuries. The vocabulary of fiction, for instance, becomes less like nonfiction prose at the same time as it becomes more like poetry.

It’s impossible to interpret a comparative result like this purely as evidence about one side of the comparison. We’re looking at a process of differentiation that involves changes on both sides: the language of nonfiction and fiction, for instance, may both have specialized in different ways.

This post is partly a response to very helpful suggestions I received from commenters, both on this blog and at Language Log. It’s especially a response to Ben Schmidt’s effort to reproduce my results using the Bookworm dataset. I also try two new measures of similarity toward the end of the post (cosine similarity and etymology) which I think interestingly sharpen the original hypothesis.

I have improved my number-crunching in four main ways (you can skip these if you’re bored):

1) In order to normalize corpus size across time, I’m now comparing equal-sized samples. Because the sample sizes are small relative to the larger collection, I have been repeating the sampling process five times and averaging results with a Fisher’s r-to-z transform. Repeated sampling doesn’t make a huge difference, but it slightly reduces noise.

2) My original blog post used 39-year slices of time that overlapped with each other, producing a smoothing effect. Ben Schmidt persuasively suggests that it would be better to use non-overlapping samples, so in this post I’m using non-overlapping 20-year slices of time.

3) I’m now running comparisons on the top 5,000 words in each pair of samples, rather than the top 5,000 words in the collection as a whole. This is a crucial and substantive change.

4) Instead of plotting a genre’s similarity to itself as a flat line of perfect similarity at the top of each plot, I plot self-similarity between two non-overlapping samples selected randomly from that genre. (Nick Lamb at Language Log recommended this approach.) This allows us to measure the internal homogeneity of a genre and use it as a control for the differentiation between genres.

Briefly, I think the central claims I was making in my original post hold up. But the constraints imposed by this newly-rigorous methodology have forced me to focus on nonfiction, fiction, and poetry. Our collections of biography and drama simply aren’t large enough yet to support equal-sized random samples across the whole period.

Here are the results for fiction compared to nonfiction, and nonfiction compared to itself.


This strongly supports the conclusion that fiction was becoming less like nonfiction, but also reveals that the internal homogeneity of the nonfiction corpus was decreasing, especially in the 18c. So some of the differentiation between fiction and nonfiction may be due to the internal diversification of nonfiction prose.

By contrast, here are the results for poetry compared to fiction, and fiction compared to itself.

Poetry and fiction are becoming more similar in the period 1720-1900. I should note that I’ve dropped the first datapoint, for the period 1700-1719, because it seemed to be an outlier. Also, we’re using a smaller sample size here, because my poetry collection won’t support 1 million word samples across the whole period. (We have stripped the prose introduction and notes from volumes of poetry, so they’re small.)

Another question that was raised, both by Ben and by Mark Liberman at Language Log, involved the relationship between “diction” and “topical content.” The Spearman correlation coefficient gives common and uncommon words equal weight, which means (in effect) that it makes no effort to distinguish style from content.

But there are other ways of contrasting diction. And I thought I might try them, because I wanted to figure out how much of the growing distance between fiction and nonfiction was due simply to the topical differentiation of nonfiction in this period. So in the next graph, I’m comparing the cosine similarity of million-word samples selected from fiction and nonfiction to distinct samples selected from nonfiction. Cosine similarity is a measure that, in effect, gives more weight to common words.


I was surprised by this result. When I get very stable numbers for any variable I usually assume that something is broken. But I ran this twice, and used the same code to make different comparisons, and the upshot is that samples of nonfiction really are very similar to other samples of nonfiction in the same period (as measured by cosine similarity). I assume this is because the growing topical heterogeneity that becomes visible in Spearman’s correlation makes less difference to a measure that focuses on common words. Fiction is much more diverse internally by this measure — which makes sense, frankly, because the most common words can be totally different in first-person and third-person fiction. But — to return to the theme of this post — the key thing is that there’s a dramatic differentiation of fiction and nonfiction in this period. Here, by contrast, are the results for nonfiction and poetry compared to fiction, as well as fiction compared to itself.

This graph is a little wriggly, and the underlying data points are pretty bouncy — because fiction is internally diverse when measured by cosine similarity, and it makes a rather bouncy reference point. But through all of that I think one key fact does emerge: by this measure, fiction looks more similar to nonfiction prose in the eighteenth century, and more similar to poetry in the nineteenth.

There’s a lot more to investigate here. In my original post I tried to identify some of the words that became more common in fiction as it became less like nonfiction. I’d like to run that again, in order to explain why fiction and poetry became more similar to each other. But I’ll save that for another day. I do want to offer one specific metric that might help us explain the differentiation of “literary” and “nonliterary” diction: the changing etymological character of the vocabulary in these genres.


Measuring the ratio of “pre-1150” to “post-1150” words is roughly like measuring the ratio of “Germanic” to “Latinate” diction, except that there are a number of pre-1150 words (like “school” and “wall”) that are technically “Latinate.” So this is essentially a way of measuring the relative “familiarity” or “informality” of a genre (Bar-Ilan and Berman 2007). (This graph is based on the top 10k words in the whole collection. I have excluded proper nouns, words that entered the language after 1699, and stopwords — determiners, pronouns, conjunctions, and prepositions.)

I think this graph may help explain why we have the impression that literary language became less specialized in this period. It may indeed have become more informal — perhaps even closer to the spoken language. But in doing so it became more distinct from other kinds of writing.

I’d like to thank everyone who responded to the original post: I got a lot of good ideas for collection development as well as new ways of slicing the collection. Katherine Harris, for instance, has convinced me to add more women writers to the collection; I’m hoping that I can get texts from the Brown Women Writers Project. This may also be a good moment to reiterate that the nineteenth-century part of the collection I’m working with was selected by Jordan Sellers, and these results should be understood as built on his research. Finally, I have put the R code that I used for most of these plots in my Open Data page, but it’s ugly and not commented yet; prettier code will appear later this weekend.

References
Laly Bar-Ilan and Ruth A. Berman, “Developing register differentiation: the Latinate-Germanic divide in English,” Linguistics 45 (2007): 1-35.

Categories
18c 19c genre comparison methodology

The differentiation of literary and nonliterary diction, 1700-1900.

When you stumble on an interesting problem, the question arises: do you blog the problem itself — or wait until you have a full solution to publish as an article?

In this case, I think the problem is too big to be solved by a single person anyway, so I might as well get it out there where we can all chip away at it. At the end of this post, I include a link to a page where you can also download the data and code I’m using.

When we compare groups of texts, we’re often interested in characterizing the contrast between them. But instead of characterizing the contrast, you could also just measure the distance between categories. For instance, you could generate a list of word frequencies for two genres, and then run a Spearman’s correlation test, to measure the rank-order similarity of their diction.

In isolation, a measure of similarity between two genres is hard to interpret. But if you run the test repeatedly to compare genres at different points in time, the changes can tell you when the diction of the genres becomes more or less similar.

Spearman similarity to nonfiction, measured at 5-year intervals. At each interval, a 39-year chunk of the collection (19 years on either side of the midpoint) is being selected for comparison.

In the graph above, I’ve done that with four genres, in a collection of 3,724 eighteenth- and nineteenth-century volumes (constructed in part by TCP and in part by Jordan Sellers — see acknowledgments), using the 10,000 most frequent words in the collection, excluding proper nouns. The black line at the top is flat, because nonfiction is always similar to itself. But the other lines decline as poetry, drama, and fiction become progressively less similar to nonfiction where word choice is concerned. Unsurprisingly, prose fiction is always more similar to nonfiction than poetry is. But the steady decline in the similarity of all three genres to nonfiction is interesting. Literary histories of this period have tended to pivot on William Wordsworth’s rebellion against a specialized “poetic diction” — a story that would seem to suggest that the diction of 19c poetry should be less different from prose than 18c poetry had been. But that’s not the pattern we’re seeing here: instead it appears that a differentiation was setting in between literary and nonliterary language.

This should be described as a differentiation of “diction” rather than style. To separate style from content (for instance to determine authorship) you need to focus on the frequencies of common words. But when critics discuss “diction,” they’re equally interested, I think, in common and less common words — and that’s the kind of measure of similarity that Spearman’s correlation will give you (Kilgarriff 2001).

The graph above makes it look as though nonfiction was remaining constant while other genres drifted away from it. But we are after all graphing a comparison with two sides. This raises the question: were poetry, fiction, and drama changing relative to nonfiction, or was nonfiction changing relative to them? But of course the answer is “both.”

At each 5-year interval, the Spearman similarity is being measured between the 40-year span surrounding that point and the period 1700-1740.

Here we’re comparing each genre to its own past. The language of nonfiction changes somewhat more rapidly than the language of the other genres, but none of them remain constant. There is no fixed reference point in this world, which is why I’m talking about the “differentiation” of two categories. But even granting that, we might want to pose another skeptical question: when literary genres become less like nonfiction, is that merely a sign of some instability in the definition of “nonfiction”? Did it happen mostly because, say, the nineteenth century started to publish on specialized scientific topics? We can address this question to some extent by selecting a more tightly defined subset of nonfiction as a reference point — say, biographies, letters, and orations.

The Spearman similarities here happen to be generated on the top 5000 words rather than the top 10000, but I have tried both wordsets and it makes very little difference.

Even when we focus on this relatively stable category, we see significant differentiation. Two final skeptical questions need addressing before I try to explain what happened. First, I’ve been graphing results so far as solid lines, because our eyes can’t sort out individual data points for four different variables at once. But a numerically savvy reader will want to see some distributions and error bars before assessing the significance of these results. So here are yearly values for fiction. In some cases these are individual works of fiction, though when there are two or more works of fiction in a single year they have been summed and treated as a group. Each year of fiction is being compared against biographies, letters, and orations for 19 years on either side.

That’s a fairly persuasive trend. You may, however, notice that the Spearman similarities for individual years on this graph are about .1 lower than they were when we graphed fiction as a 39-year moving window. In principle Spearman similarity is independent of corpus size, but it can be affected by the diversity of a corpus. The similarity between two individual texts is generally going to be lower than the similarity between two large and diverse corpora. So could the changes we’ve seen be produced by changes in corpus size? There could be some effect, but I don’t think it’s large enough to explain the phenomenon. [See update at the bottom of this post. The results are in fact even clearer when you keep corpus size constant. -Ed.] The sizes of the corpora for different genres don’t change in a way that would produce the observed decreases in similarity; the fiction corpus, in particular, gets larger as it gets less like nonfiction. Meanwhile, it is at the same time becoming more like poetry. We’re dealing with some factor beyond corpus size.

So how then do we explain the differentiation of literary and nonliterary diction? As I started by saying, I don’t expect to provide a complete answer: I’m raising a question. But I can offer a few initial leads. In some ways it’s not surprising that novels would gradually become less like biographies and letters. The novel began very much as faked biography and faked correspondence. Over the course of the period 1700-1900 the novel developed a sharper generic identity, and one might expect it to develop a distinct diction. But the fact that poetry and drama seem to have experienced a similar shift (together with the fact that literary genres don’t seem to have diverged significantly from each other) begins to suggest that we’re looking at the emergence of a distinctively “literary” diction in this period.

To investigate the character of that diction, we need to compare the vocabulary of genres at many different points. If we just compared late-nineteenth-century fiction to late-nineteenth-century nonfiction, we would get the vocabulary that characterized fiction at that moment, but we wouldn’t know which aspects of it were really new. I’ve done that on the side here, using the Mann-Whitney rho test I described in an earlier post. As you’ll see, the words that distinguish fiction from nonfiction from 1850 to 1900 are essentially a list of pronouns and verbs used to describe personal interaction. But that is true to some extent about fiction in any period. We want to know what aspects of diction had changed.

In other words, we want to find the words that became overrepresented in fiction as fiction was becoming less like nonfiction prose. To find them, I compared fiction to nonfiction at five-year intervals between 1720 and 1880. At each interval I selected a 39-year slice of the collection and ranked words according to the extent to which they were consistently more prominent in fiction than nonfiction (using Mann-Whitney rho). After moving through the whole timeline you end up with a curve for each word that plots the degree to which it is over or under-represented in fiction over time. Then you sort the words to find ones that tend to become more common in fiction as the whole genre becomes less like nonfiction. (Technically, you’re looking for an inverse Pearson’s correlation, over time, between the Mann-Whitney rho for this word and the Spearman’s similarity between genres.) Here’s a list of the top 60 words you find when you do that:


It’s not hard to see that there are a lot of words for emotional conflict here (“horror, courage, confused, eager, anxious, despair, sorrow, dread, agony”). But I would say that emotion is just one aspect of a more general emphasis on subjectivity, ranging from verbs of perception (“listen, listened, watched, seemed, feel, felt”) to explicitly psychological vocabulary (“nerves, mind, unconscious, image, perception”) to questions about the accuracy of perception (“dream, real, sight, blind, forget, forgot, mystery, mistake”). To be sure, there are other kinds of words in the list (“cottage, boy, carriage”). But since we’re looking at a change across a period of 200 years, I’m actually rather stunned by the thematic coherence of the list. For good measure, here are words that became relatively less common in fiction (or more common in nonfiction — that’s the meaning of “relatively”) as the two genres differentiated:


Looking at that list, I’m willing to venture out on a limb and suggest that fiction was specializing in subjectivity while nonfiction was tending to view the world from an increasingly social perspective (“executive, population, colonists, department, european, colonists, settlers, number, individuals, average.”)

Now, I don’t pretend to have solved this whole problem. First of all, the lists I just presented are based on fiction; I haven’t yet assessed whether there’s really a shared “literary diction” that unites fiction with poetry and drama. Jordan and I probably need to build up our collection a bit before we’ll know. Also, the technique I just used to select lists of words looks for correlations across the whole period 1700-1900, so it’s going to select words that have a relatively continuous pattern of change throughout this period. But it’s also entirely possible that “the differentiation of literary and nonliterary diction” was a phenomenon composed of several different, overlapping changes with a smaller “wavelength” on the time axis. So I would say that there’s lots of room here for alternate/additional explanations.

But really, this is a question that does need explanation. Literary scholars may hate the idea of “counting words,” but arguments about a distinctively “literary” language have been central to literary criticism from John Dryden to the Russian Formalists. If we can historicize that phenomenon — if we can show that a systematic distinction between literary and nonliterary language emerged at a particular moment for particular reasons — it’s a result that ought to have significance even for literary scholars who don’t consider themselves digital humanists.

By the way, I think I do know why the results I’m presenting here don’t line up with our received impression that “poetic diction” is an eighteenth-century phenomenon that fades in the 19c. There is a two-part answer. For one thing, part of what we perceive as poetic diction in the 18c is orthography (“o’er”, “silv’ry”). In this collection, I have deliberately normalized orthography, so “silv’ry” is treated as equivalent to “silvery,” and that aspect of “poetic diction” is factored out.

But we may also miss differentiation because we wrongly assume that plain or vivid language cannot be itself a form of specialization. Poetic diction probably did become more accessible in the 19c than it had been in the 18c. But this isn’t the same thing as saying that it became less specialized! A self-consciously plain or restricted diction still counts as a mode of specialization relative to other written genres. More on this in a week or two …

Finally, let me acknowledge that the work I’m doing here is built on a collaborative foundation. Laura Mandell helped me obtain the TCP-ECCO volumes before they were public, and Jordan Sellers selected most of the nineteenth-century collection on which this work is based — something over 1,600 volumes. While Jordan and I were building this collection, we were also in conversation with Loretta Auvil, Boris Capitanu, Tanya Clement, Ryan Heuser, Matt Jockers, Long Le-Khac, Ben Schmidt, and John Unsworth, and were learning from them how to do this whole “text mining” thing. The R/MySQL infrastructure for this is pretty directly modeled on Ben’s. Also, since the work was built on a collaborative foundation, I’m going to try to give back by sharing links to my data and code on this “Open Data” page.

References
Adam Kilgarriff, “Comparing Corpora,” International Journal of Corpus Linguistics 6.1 (2001): 97-133.

[UPDATE Monday Feb 27th, 7 pm: After reading Ben Schmidt’s comment below, I realized that I really had to normalize corpus size. “Probably not a problem” wasn’t going to cut it. So I wrote a script that samples a million-word corpus for each genre every two years. As long as I was addressing that problem, I figured I would address another one that had been nagging at my conscience. I really ought to be comparing a different wordlist each time I run the comparison. It ought to be the top 5000 words in each pair of corpora that get compared — not the top 5000 words in the collection as a whole.

The first time I ran the improved version I got a cloud of meaningless dots, and for a moment I thought my whole hypothesis about genre had been produced by a ‘loose optical cable.’ Not a good moment. But it was a simple error, and once I fixed it I got results that were actually much clearer than my earlier graphs.

I suppose you could argue that, since document size varies across time, it’s better to select corpora that have a fixed number of documents rather than a fixed word size. I ran the script that way too, and it produces results that are noisier but still unambiguous. The moral of the story is: it’s good to have blog readers who keep you honest and force you to clean up your methodology!]

Categories
teaching

Syllabus: ENGL581: Digital Tools and Critical Theory.

This syllabus is indebted to just about everyone who has posted a syllabus for a DH course, and especially to Paul Fyfe, from whose draft syllabus I borrowed several readings.

The syllabus itself is here as a .pdf file.

As you’ll see if you download it, this is not a general digital humanities course. At Urbana-Champaign, John Unsworth has been teaching an introduction to digital humanities in the Graduate School of Library and Information Science, and there’s no way I could hope to replicate his breadth of knowledge. Instead I’ve focused on literary and historical applications of text mining, because that’s an area where I feel I can teach skills that a wide range of humanities graduate students will find immediately useful.

I realize the choice of focus may seem odd, since text mining is a relatively controversial subfield of DH, and a technically challenging one. There’s no way to duck the technical challenge: I am going to try to teach enough coding (using R) to empower students to define their own questions and visualize their own results. But I don’t think controversies about quantification need to be a problem, since I approach text mining largely as a discovery strategy. I hope it will turn up insights and clues that students find useful, without necessarily compelling them to add a lot of numbers or graphs to their arguments.

The “tools” and “theory” in the title of the course are not meant to be pitted against each other. The title instead flags a working assumption that practice and theory are fused: our interpretive theories are already shaped by the social/technical infrastructure we use to find and read texts, so reflectively reshaping that infrastructure is a way of “doing theory.”

Categories
interpretive theory statistics

Do humanists get their ideas from anything at all?

My reaction to Stanley Fish’s third column on digital humanities was at first so negative that I thought it not worth writing about. But in the light of morning, there is something here worth discussing. Fish raises a neglected issue that I (and a bunch of other people cited at the end of this post) have been trying to foreground: the role of discovery in the humanities. He raises the issue symptomatically, by suppressing it, but the problem is too important to let that slide.

Fish argues, in essence, that digital humanists let the data suggest hypotheses for them instead of framing hypotheses that are then tested against evidence.

The usual way of doing this is illustrated by my example: I began with a substantive interpretive proposition … and, within the guiding light, indeed searchlight, of that proposition I noticed a pattern that could, I thought be correlated with it. I then elaborated the correlation.

The direction of my inferences is critical: first the interpretive hypothesis and then the formal pattern, which attains the status of noticeability only because an interpretation already in place is picking it out.

The direction is the reverse in the digital humanities: first you run the numbers, and then you see if they prompt an interpretive hypothesis. The method, if it can be called that, is dictated by the capability of the tool.

The underlying element of truth here is that all researchers — humanists and scientists alike — do need to separate the process of discovering a hypothesis from the process of testing it. Otherwise you run into what we unreflecting empiricists call “the problem of data dredging.” If you simply sweep a net through an ocean of data, and frame a conclusion based on whatever you catch, you’re not properly testing anything, because you’re implicitly testing an infinite number of hypotheses that are left unstated — and the significance of any single test is reduced when it’s run as part of a large battery.

That’s true, but it’s also a problem that people who do data mining are quite self-conscious about. It’s why I never stop linking to this xkcd comic about “significance.” And it’s why Matt Wilkens (mistargeted by Fish as an emblem of this interpretive sin) goes through a deliberately iterative process of first framing hypotheses about nineteenth-century geographical imagination and then testing them. (For instance, after noticing that certain states seem especially prominent in 19c American fiction, he tests whether this remains true after you compensate for differences in population size, and then proposes a pair of hypotheses that he suggests will need to be evaluated against additional “test cases.”)

Wiliam Blake, "Satan, Sin, and Death"

More importantly, Fish profoundly misrepresents his own (traditional) interpretive procedure by pretending that the act of interpretation is wholly contained in a single encounter with evidence. On his account we normally begin with a hypothesis (which seems to have sprung, like Sin, fully-formed from our head), and test it against a single sentence.

In reality, of course, our “interpretive proposition” is often suggested by the same evidence that confirms it. Or — more commonly — we derive a hypothesis from one example, and then read patiently through dozens of books until we have gathered enough confirming evidence to write a chapter. This process runs into a different interpretive fallacy: if you keep testing a hypothesis until you’ve confirmed it, you’re not testing it at all. And it’s a bit worse than that, because in practice what we do now is go to a full-text search engine and search for terms that would go together if our assumptions were correct. (In the example Fish offers, this might be “bishops” and “presbyters.”) If you find three sentences where those terms coincide, you’ve got more than enough evidence to prop up an argument, using our richly humanistic (cough, anecdotal) conception of evidence. And of course a full-text search engine can find you three examples of just about anything. But we don’t have to worry about this, because search engines are not tools that dictate a method; they are transparent extensions of our interpretive sensibility.

The basic mistake that Fish is making is this: he pretends that humanists have no discovery process at all. For Fish, the interpretive act is always fully contained in an encounter with a single piece of evidence. How your “interpretive proposition” got framed in the first place is a matter of no consequence: some readers are just fortunate to have propositions that turn out to be correct. Fish is not alone in this idealized model of interpretation; it’s widespread among humanists.

Fish is resisting the assistance of digital techniques, not because they would impose scientism on the humanities, but because they would force us to acknowledge that our ideas do after all come from somewhere — whether a search engine or a commonplace book. But as Peter Stallybrass eloquently argued five years ago in PMLA (h/t Mark Sample) the process of discovery has always been collaborative, and has long — at least since early modernity — been embodied in specific textual technologies.

References
Stallybrass, Peter. “Against Thinking.” PMLA 122.5 (2007): 1580-1587.
Wilkens, Matthew. “Geolocation Extraction and Mapping of Nineteenth-Century U.S. Fiction.” DHCS 2011.
On the process of embodied play that generates ideas, see also Stephen Ramsay’s book Reading Machines (University of Illinois Press, 2011).

Categories
impressionistic criticism

Fish wins round two.

This barely deserves to be a blog post, but I can’t resist a brief critical appreciation of Stanley Fish’s second column on the digital humanities.

Fish argues that digital humanists’ insistence on the networked character of human communication (or even human identity) makes them a) postmodern, b) theological, in the sense that they’re promising a transcendence of individual mortality, and c) political in an explicitly leftist way. In making these points, he cites about 2.5% of the people in my Twitter stream, which is one reason why I like the column.

The cover of Neuromancer that I remember. I may not have copyright to this image, but file-sharing is part of my religion.

The main reason I like it, though, is that it raises the bar for stylistic slipperiness in the pages of the NYT. Fish begins the column by posing as someone with a firm belief in the stability of the text, and in authorial identity. He says that he believes in these as strongly, in fact, as the critic Morris Zapp. This is pretty delicious, given that Zapp is a fictional character notoriously modeled on Stanley Fish. He can hardly function as an emblem of stable authorial identity … though he might well emblematize the immortal alter-ego that writing has always made possible. I’m reminded of the “laugh that wasn’t laughter” at the end of Neuromancer.

Which brings me to the only place in the column where I do feel dissed. Fish thinks humanists promoting DH will be shocked by the notion that enthusiasm for the web involves a religious transcendence of mortality. Come on — we’ve read @GreatDismal. Moreover, a lot of us have read Emile Durkheim on the religious character of all social feeling, or Carl Becker on the Enlightenment’s secular faith in posterity. Just about all forms of reflection on history and writing promise a transcendence of individual identity.

What’s more fun are the cases where they become religions in a socially concrete way — like the Swedish church of Kopimism, brought to my attention by James Dabbs, which makes the act of file-sharing its central sacrament.

I enjoyed this column so much that I’m hoping the third installment (about digital analysis of “aesthetic works”) will be equally thoughtful and slippery. I’m rooting for Fish to resist the magnetic pull of formulations like “computers will never …” and “merely counting words can never ….” But those binary assumptions are hard to resist: I’m going to be wracked with suspense.

UPDATE Jan 23rd. This really isn’t worth a blog post. But I should just briefly register my disappointment in Fish’s third column. It’s sophistry, and not even sophistry of an interesting kind. Once you say “excluded middle fallacy founded on willful misreading of two examples,” you’ve pretty much done all that needs to be done with it. Too bad.

Categories
18c fiction methodology

MLA talk: just the thesis.

Giving a talk this morning at the MLA. There are two main arguments:

1) The first one will be familiar if you’ve read my blog. I suggest that the boundary between “text mining” and conventional literary research is far fuzzier than people realize. There appears to be a boundary only because literary scholars are pretty unreflective about the way we’re currently using full-text search. I’m going to press this point in detail, because it’s not just a metaphor: to produce a simple but useful topic-modeling algorithm, all you have to do is take a search engine and run it backwards.

2) The second argument is newer; I don’t think I’ve blogged about it yet. I’m going to present topic modeling as a useful bridge between “distant” and “close” reading. I’ve found that I often learn most about a genre by modeling it as part of a larger collection that includes many other genres. In that context, a topic-modeling algorithm can highlight peculiar convergences of themes that characterize the genre relative to its contemporary backdrop.

a slide from the talk, where a simple topic-modeling algorithm has been used to produce a dendrogram that offers a clue about the temporal framing of narration in late-18c novels

This is distant reading, in the sense that it requires a large collection. But it’s also close reading, in the sense that it’s designed to reveal subtle formal principles that shape individual works, and that might otherwise elude us.

Although the emphasis is different, a lot of the examples I use are recycled from a talk I gave in August, described here.

Categories
math methodology statistics undigitized humanities

A brief outburst about numbers.

In responding to Stanley Fish last week, I tried to acknowledge that the “digital humanities,” in spite of their name, are not centrally about numbers. The movement is very broad, and at the broadest level, it probably has more to do with networked communication than it does with quantitative analysis.

The older tradition of “humanities computing” — which was about numbers — has been absorbed into this larger movement. But it’s definitely the part of DH that humanists are least comfortable with, and it often has to apologize for itself. So, for instance, I’ve spent much of the last year reminding humanists that they’re already using quantitative text mining in the form of search engines — so it can’t be that scary.* Kathleen Fitzpatrick recently wrote a post suggesting that “one key role for a ‘worldly’ digital humanities may well be helping to break contemporary US culture of its unthinking association of numbers with verifiable reality….” Stephen Ramsay’s Reading Machines manages to call for an “algorithmic criticism” while at the same time suggesting that humanists will use numbers in ways that are altogether different from the way scientists use them (or at least different from “scientism,” an admittedly ambiguous term).

I think all three of us (Stephen, Kathleen, and myself) are making strategically necessary moves. Because if you tell humanists that we do (also) need to use numbers the way scientists use them, your colleagues are going to mutter about naïve quests for certainty, shake their heads, and stop listening. So digital humanists are rhetorically required to construct positivist scapegoats who get hypothetically chased from our villages before we can tell people about the exciting new kinds of analysis that are becoming possible. And, to be clear, I think the people I’ve cited (including me) are doing that in fair and responsible ways.

However, I’m in an “eppur si muove” mood this morning, so I’m going to forget strategy for a second and call things the way I see them. <Begin Galilean outburst>

In reality, scientists are not naïve about the relationship between numbers and certainty, because they spend a lot of time thinking about statistics. Statistics is the science of uncertainty, and it insists — as forcefully as any literary theorist could — that every claim comes accompanied by a specific kind of ignorance. Once you accept that, you can stop looking for absolute knowledge, and instead reason concretely about your own relative uncertainty in a given instance. I think humanists’ unfamiliarity with this idea may explain why our critiques of data mining so often taken the form of pointing to a small error buried somewhere in the data: unfamiliarity with statistics forces us to fall back on a black-and-white model of truth, where the introduction of any uncertainty vitiates everything.

Moreover, the branch of statistics most relevant to text mining (Bayesian inference) is amazingly, almost bizarrely willing to incorporate subjective belief into its definition of knowledge. It insists that definitions of probability have to depend not only on observed evidence, but on the “prior probabilities” that we expected before we saw the evidence. If humanists were more familiar with Bayesian statistics, I think it would blow a lot of minds.

I know the line about “lies, damn lies, and so on,” and it’s certainly true that statistics can be abused, as this classic xkcd comic shows. But everything can be abused. The remedy for bad verbal argument is not to “remember that speech should stay in its proper sphere” — it’s to speak better and more critically. Similarly, the remedy for bad quantitative argument is not “remember that numbers have to stay in their proper sphere”; it’s to learn statistics and reason more critically.

possible shapes of the Beta distribution, from Wikpedia

None of this is to say that we can simply borrow tools or methods from scientists unchanged. The humanities have a lot to add — especially when it comes to the social and historical character of human behavior. I think there are fascinating advances taking place in data science right now. But when you take apart the analytic tools that computer scientists have designed, you often find that they’re based on specific mistaken assumptions about the social character of language. For instance, there’s a method called “Topics over Time” that I want to use to identify trends in the written record (Wang and McCallum, 2006). The people who designed it have done really impressive work. But if a humanist takes apart the algorithm underlying this method, they will find that it assumes that every trend can be characterized as a smooth curve called a “Beta distribution.” Whereas in fact, humanists have evidence that the historical trajectory of a topic is often more complex than that, in ways that really matter. So before I can use this tool, I’m going to have to fix that part of the method.
The diachronic behavior a topic can actually exhibit.

But this is a problem that can be fixed, in large part, by fixing the numbers. Humanists have a real contribution to make to the science of data mining, but it’s a contribution that can be embodied in specific analytic insights: it’s not just to hover over the field like the ghost of Ben Kenobi and warn it about hubris.

</Galilean outburst>

For related thoughts, somewhat more temperate than the outburst above, see this excellent comment by Matthew Wilkens, responding to a critique of his work by Jeremy Rosen.

* I credit Ben Schmidt for this insight so often that regular readers are probably bored. But for the record: it comes from him.

Categories
methodology undigitized humanities

Why digital humanities isn’t actually “the next thing in literary studies.”

Photo taken by Gary Stevens of HostingCanada. Tube modules in an IBM mainframe, used in the RCS/RI photo collection. https://hostingcanada.org

It’s flattering for digital humanists to be interpellated by Stanley Fish as the next thing in literary studies. It’s especially pleasant since the field is old enough now to be tickled by depiction as a recent fad — as Fish must know, since he tangled with an earlier version of it (“humanities computing”) in the 80s.

Fish seems less suspicious of computing these days, and he understands the current contours of digital humanities well. As he implies, DH is not a specific method or theory, but something more like a social movement that extends messily from “the refining of search engines” to “the rethinking of peer review.”

In short, Fish’s column is kind enough. But I want to warn digital humanists about the implications of his flattery. Literary scholars are addicted to a specific kind of methodological conflict. Fish is offering an invitation to consider ourselves worthy of joining the fight. Let’s not.

The outlines of the debate I have in mind emerge at the end of this column as Fish sets up his next one. It turns out that the discipline of literary studies is in trouble! Maybe enrollments are down, or literary culture is in peril; as Fish himself hints, this script is so familiar that we hardly need to spell out the threat. Anyway, the digital humanities have implicitly promised that their new version of the discipline will ensure “the health and survival of the profession.” But can they really do so? Tune in next week …

Or don’t. As flattering as it is to be cast in this drama, digital humanists would be better advised to bow out. The disciplinary struggle that Fish wants to stage around us is not our fight, and was perhaps never a very productive fight anyway.

In explaining why I feel this way, I’m going to try to address both colleagues who “do” DH and those who are apprehensive about it. I think it’s fair to be apprehensive, but the apprehension I’m hearing these days (from Fish and from my own friends) seems to me too narrowly targeted. DH is not the kind of trend humanists are used to, which starts with a specific methodological insight and promises to revive a discipline (or two) by generalizing that insight. It’s something more diffuse, and the diffuseness matters.

1. Why isn’t digital humanities yet another answer to the question “How should we save literary studies?” First of all, because digital humanities is not a movement within literary studies. It includes historians and linguists, computer scientists and librarians.

“Interdisciplinary?” Maybe, but extra-disciplinary might be a better word, because DH is not even restricted to the ranks of faculty. When I say “librarians,” I mean not only faculty in library schools, but people with professional appointments in libraries. Academic professionals have often been the leading figures in this field.

So DH is really not another movement to revitalize literary studies by making it relevant to [X]. There are people who would like to cast it in those terms. Doing so would make it possible to stage a familiar sort of specifically disciplinary debate. It would also, incidentally, allow the energy of the field to be repossessed by faculty, who have historically been in charge of theoretical debate, but not quite so securely in charge of (say) collaborations to build new infrastructure. [I owe this observation to Twitter conversation with Bethany Nowviskie and Miriam Posner.]

But reframing digital humanities in that way would obscure what’s actually interesting and new about this moment — new opportunities for collaboration both across disciplines and across the boundary between the conceptual work of academia and the infrastructure that supports and tacitly shapes it.

2) That sounds very nice, but isn’t there still an implicit disciplinary argument — and isn’t that the part of this that matters?

I understand the suspicion. In literary studies, change has almost always taken place through a normative claim about the proper boundaries of the discipline. Always historicize! Or on second thought no, don’t historicize, but instead revive literary culture by returning to our core competence of close reading!

But in my experience digital humanists are really not interested in regulating disciplinary boundaries — except insofar as they want a seat at the table. “Isn’t DH about turning the humanities into distant reading and cliometrics and so on?” I understand the suspicion, but no. I personally happen to be enthusiastic about distant reading, but DH is more diverse than that. Digital humanists approach interpretation in a lot of different ways, at different scales. Some people focus tightly on exploration of a single work. “But isn’t it in any case about displacing interpretation with a claim to empirical truth?” Absolutely not. Here I can fortunately recommend Stephen Ramsay’s recent book Reading Machines, which understands algorithms as ways of systematically deforming a text in order to enhance interpretive play. Ramsay is quite eloquent about the dangers of “scientism.”

The fundamental mistake here may be the assumption that quantitative methods are a new thing in the humanities, and therefore must imply some new and terrifyingly normative positivism. They aren’t new. All of us have been using quantitative tools for several decades — and using them to achieve a wide variety of theoretical ends. The only thing that’s new in the last few years is that humanists are consciously taking charge of the tools ourselves. But I’ve said a lot about that in the past, so I’ll just link to my previous discussion.

3. Well, shouldn’t DH be promising to save literary studies, or the humanities as a whole? Isn’t it irresponsible to ignore the present crisis in academia?

Digital humanists haven’t ignored the social problems of academia; on the contrary, as Fish acknowledges, they’re engaging those problems at multiple levels. Rethinking peer review and scholarly publishing, for instance. Or addressing the tattered moral logic of graduate education by trying to open alternate career paths for humanists. Whatever it means to “do digital humanities,” it has to imply thinking about academia as a social institution.

But it doesn’t have to imply the mode of social engagement that humanists have often favored — which is to make normative claims about the boundaries of our own disciplines, with the notion that in doing so we are defending some larger ideal. That’s not a part of the job we should feel guilty about skipping.

4. Haven’t you defined “digital humanities” so broadly that it’s impossible to make a coherent argument for or against it?

I have, and that might be a good thing. I sometimes call DH a “field” because I lack a better word, but digital humanities is not a discipline or a coherent project. It’s a rubric under which a bunch of different projects have gathered — from new media studies to text mining to the open-access movement — linked mainly by the fact that they are responding to related kinds of fluidity: rapid changes in representation, communication, and analysis that open up detours around some familiar institutions.

It’s hard to be “for” or “against” a set of developments like this — just as it was hard to be for or against all types of “theory” at the same time. Of course, the emptiness of a generally pro- or anti-theory position never stopped us! Literary scholars are going to want to take a position on DH, as if it were a familiar sort of polemical project. But I think DH is something more interesting than that — intellectually less coherent, but posing a more genuine challenge to our assumptions.

I suppose, if pressed, I would say “digital humanities” is the name of an opportunity. Technological change has made some of the embodiments of humanistic work — media, archives, institutions, perhaps curricula — a lot more plastic than they used to be. That could turn out to be a good thing or a bad thing. But it’s neither of those just yet: the meaning of the opportunity is going to depend on what we make of it.

Categories
methodology undigitized humanities

What no one tells you about the digital humanities.

There are already several great posts out there that exhaustively list resources and starting points for people getting into DH (a lot of them are by Lisa Spiro, who is good at it).

Opportunities are not always well signposted.

This will be a shorter list. I’m still new enough at this to remember what surprised me in the early going, and there were two areas where my previous experience in the academy failed to prepare me for the fluid nature of this field.

1) I had no idea, going into this, just how active a scholarly field could be online. Things are changing rapidly — copyright lawsuits, new tools, new ideas. To find out what’s happening, I think it’s actually vital to lurk on Twitter. Before I got on Twitter, I was flying blind, and didn’t even realize it. Start by following Brett Bobley, head of the Office of Digital Humanities at the NEH. Then follow everyone else.

2) The technical aspect of the field is important — too important, in many cases, to be delegated. You need to get your hands dirty. But the technical aspect is also much less of an obstacle than I originally assumed. There’s an amazing amount of information on the web, and you can teach yourself to do almost anything in a couple of weekends.* Realizing that you can is half the battle. For a pep talk / inspiring example, try this great narrative by Tim Sherratt.

That’s it. If you want more information, see the links to Lisa Spiro and DiRT at the top of this post. Lisa is right, by the way, that the place to start is with a particular problem you want to solve. Don’t dutifully acquire skills that you think you’re supposed to have for later use. Just go solve that problem!

* ps: Technical obstacles are minor even if you want to work with “big data.” We’re at a point now where you can harvest your own big data — big, at least, by humanistic standards. Hardware limitations are not quite irrelevant, but you won’t hit them for the first year or so, though you may listen anxiously while that drive grinds much more than you’re used to …

Categories
18c 19c math methodology ngrams

Exploring the relationship between topics and trends.

I’ve been talking about correlation since I started this blog. Actually, that was the reason why I did start it: I think literary scholars can get a huge amount of heuristic leverage out of the fact that thematically and socially-related words tend to rise and fall together. It’s a simple observation, and one that stares you in the face as soon as you start to graph word frequencies on the time axis.1 But it happens to be useful for literary historians, because it tends to uncover topics that also pose periodizable kinds of puzzles. Sometimes the puzzle takes the form of a topic we intuitively recognize (say, the concept of “color”) that increases or decreases in prominence for reasons that remain to be explained:

At other times, the connection between elements of the topic is not immediately intuitive, but the terms are related closely enough that their correlation suggests a pattern worthy of further exploration. The relationship between terms may be broadly historical:

Or it may involve a pattern of expression that characterizes a periodizable style:

Of course, as the semantic relationship between terms becomes less intuitively obvious, scholars are going to wonder whether they’re looking at a real connection or merely an accidental correlation. “Ardent” and “tranquil” seem like opposites; can they really be related as elements of a single discourse? And what’s the relationship to “bosom,” anyway?

Ultimately, questions like this have to be addressed on a case-by-case basis; the significance of the lead has to be fleshed out both with further analysis, and with close reading.

But scholars who are wondering about the heuristic value of correlation may be reassured to know that this sort of lead does generally tend to pan out. Words that correlate with each other across the time axis do in practice tend to appear in the same kinds of volumes. For instance, if you randomly select pairs of words from the top 10,000 words in the Google English ngrams dataset 1700-1849,2 measure their correlation with each other in that dataset across the period 1700-1849, and then measure their tendency to appear in the same volumes in a different collection3 (taking the cosine similarity of term vectors in a term-document matrix), the different measures of association correlate with each other strongly. (Pearson’s r is 0.265, significant at p < 0.0005.) Moreover, the relationship holds (less strongly, but still significantly) even in adjacent centuries: words that appear in the same eighteenth-century volumes still tend to rise and fall together in the nineteenth century.

Why should humanists care about the statistical relationship between two measures of association? It means that correlation-mining is in general going to be a useful way of identifying periodizable discourses. If you find a group of words that correlate with each other strongly, and that seem related at first glance, it's probably going to be worthwhile to follow up the hunch. You’re probably looking at a discourse that is bound together both diachronically (in the sense that the terms rise and fall together) and topically (in the sense that they tend to appear in the same kinds of volumes).

Ultimately, literary historians are going to want to assess correlation within different genres; a dataset like Google's, which mixes all genres in a single pool, is not going to be an ideal tool. However, this is also a domain where size matters, and in that respect, at the moment, the ngrams dataset is very helpful. It becomes even more helpful if you correct some of the errors that vitiate it in the period before 1820. A team of researchers at Illinois and Stanford4, supported by the Andrew W. Mellon Foundation, has been doing that over the course of the last year, and we're now able to make an early version of the tool available on the web. Right now, this ngram viewer only covers the period 1700-1899, but we hope it will be useful for researchers in that period, because it has mostly corrected the long-s problem that confufes opt1cal charader readers in the 18c — as well as a host of other, less notorious problems. Moreover, it allows researchers to mine correlations in the top 10,000 words of the lexicon, instead of trying words one by one to see whether an interesting pattern emerges. In the near future, we hope to expand the correlation miner to cover the twentieth century as well.

For further discussion of the statistical relationship between topics and trends, see this paper submitted to DHCS 2011.

UPDATE Nov 22, 2011: At DHCS 2011, Travis Brown pointed out to me that Topics Over Time (Wang and McCallum) might mine very similar patterns in a more elegant, generative way. I hope to find a way to test that method, and may perhaps try to build an implementation for it myself.

References
1) Ryan Heuser and I both noticed this pattern last winter. Ryan and Long Le-Khac presented on a related topic at DH2011: Heuser, Ryan, and Le-Khac, Long. “Abstract Values in the 19th Century British Novel: Decline and Transformation of a Semantic Field,” Digital Humanities 2011, Stanford University.

2) Jean-Baptiste Michel*, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, William Brockman, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak, and Erez Lieberman Aiden*. “Quantitative Analysis of Culture Using Millions of Digitized Books.” Science (Published online ahead of print: 12/16/2010)

3) The collection of 3134 documents (1700-1849) I used for this calculation was produced by combining ECCO-TCP volumes with nineteenth-century volumes selected and digitized by Jordan Sellers.

4) The SEASR Correlation Analysis and Ngrams Viewer was developed by Loretta Auvil and Boris Capitanu at the Illinois Informatics Institute, modeled on prototypes built by Ted Underwood, University of Illinois, and Ryan Heuser, Stanford.