The Stone and the Shell

Getting everything you want from HathiTrust.

Post author By tedunderwood
Post date July 27, 2012
7 Comments on Getting everything you want from HathiTrust.

justshelves — NYPL, photo courtesy Alex Proimos

A fair number of scholars would like to work on large digital collections, but aren’t entirely sure where to get them.

For people who work on text after, say, 1700, I’d like to briefly make a case for HathiTrust. I’m a few months into a project based on 800,000 volumes — collaborating with Mike Black, an English Ph.D student and extraordinary Python programmer. We decided to get our collection from HathiTrust, and it’s a decision I haven’t regretted. In terms of sheer numbers, I don’t know whether they’re larger than, say, the Internet Archive. But their collection has some subtle details that I’ve come to greatly appreciate.

For one thing, they divide documents into individual page files. At first this may seem like a pain (you want a file, right, not a folder of files?) But in fact it’s a significant advantage to have that hard-coded representation of page breaks. It has made it possible for Mike to design a Python script that a) recognizes running headers at the tops of pages b) uses them to make a reasonable guess about chapters and other document divisions and then c) removes the headers, which can otherwise throw a wrench in your topic model.

Also, the HathiTrust API is solid and well documented. If you request a large dataset from them, you will get metadata with it. But the availability of the bibliographic API can still be a significant benefit. (By the way, re: metadata — ask them to give you the complete .json record, not just the marc-record part of the json.)

For small numbers of texts, you could in fact get the text itself from the data API. But this is not recommended for a big collection. Instead you’re going to want to write Hathi and request that they construct a dataset for you, based on facets that would be available in their Advanced Search feature. Once they build it — which could take a few weeks to a month — you can send them a hard drive or download data through rsync. (I initially found rsync perplexing, but after the nice people at Hathi gave me precise instructions, it was easy.) Using rsync through my campus office connection, it took about two days to transfer 800,000 volumes, which consumed a little less than 1TB of disk space. It would have been slower if I had tried to do it at home through commercial broadband and an AirPort.

There is a lot of time involved simply in moving data around, and in part I’m writing this post to warn people about that. One really basic point that took me a while to figure out: do not try to unzip the files. Part of the reason why it’s slow to move a large collection is that separate files require your i/o to do a lot of starting and stopping. That’s hard enough with (say) 500,000 separate zipped document folders. If you unzip those documents and get 165 million separate page files, it becomes very hard indeed. I actually spent more than a week unzipping the collection, and about a week trying to move it from one drive to another — only to get a disk error halfway through the process that required reformat.

Mothers, teach your children not to do as I have done. Just use the Python module zipfiles that works directly with the .zip file. It takes Python a few tenths of a second to extract the data, but it’s much better than trying to move 165 million individual pages. H/t to Loretta Auvil, by the way, for convincing me that this was simpler.

I’m going to try to make available the Python scripts and lexica that Mike and I design for working with the collection. There are

a) Simple logistical issues, like navigating the pairtree folder structure where files are stored and extracting them from .zip.
b) Metadata issues, like normalizing dates of publication that can be “1871” or “[18–]”.
c) Document-format issues, like running headers and page numbers.
d) OCR issues, which are the really fun ones as far as I’m concerned.

We’ve written pieces of all of this, and (a) through (c) are working, but it’s not yet in beta (to put it mildly). However, if you’re grappling with a similar problem, drop me a line and I’ll send you our code, such as it is. Development of this code was supported by the Andrew W. Mellon Foundation.

I’d also like to encourage everyone who’s interested in these kinds of problems to attend the HathiTrust Research Center UnCamp in Indiana this September (pre-register by August 1). This should be particularly useful if you’re interested in working on collections after 1923. HTRC has begun to design an infrastructure that will permit non-consumptive or non-expressive research on texts without transmitting the text itself to the researcher — obviously a crucial part of the solution to the problem of research on copyrighted text. They hope to demo parts of that infrastructure in September — but if you show up, you also have a fair chance of getting input on the design of the final version.

DH as a social phenomenon undigitized humanities

I like “open.” And I like “review.” But do they need to be fused?

Post author By tedunderwood
Post date June 21, 2012
6 Comments on I like “open.” And I like “review.” But do they need to be fused?

There’s been a lot of discussion this week about “open review.” Kathleen Fitzpatrick and Avi Santo have drafted a thoughtful and polished white paper on the topic, with input from many other hands. Alex Reid also commented insightfully.

FedericaMarchi — “Open Doors,” Federica Marchi 2007, CC BY-NC-ND 2.0.

Though my assessment of print scholarship is not as dark as Alex’s, I do share a bit of his puzzlement. To me, the concept of “open review” sometimes feels like an attempt to fit a round peg in a square hole.

I’m completely convinced about the value of the open intellectual exchange that happens on academic blogs. I’m constantly learning from other people’s blogs, and from their comments on mine. I’ve been warned away from dead ends, my methodology has improved, I’ve learned about sources I would otherwise have overlooked. It’s everything that’s supposed to happen at a conference — but rarely does. And you don’t have to pay for a plane ticket.

This kind of exchange is “open,” and it has intellectual value. On the other hand, I have no desire to claim that it constitutes a “review” process. It’s better than review: it’s learning. I don’t feel that I need to get credit for it on my vita, because the stuff I learn is going to produce articles … which I can then list on my vita.

As far as those articles are concerned, I’m more or less happy with existing review structures. I don’t (usually) learn as much from the formal review process as I do from blogs, but I’m okay with that: I can live with the fact that “review” is about selection and validation rather than open dialogue. (Also, note “usually” above: there are exceptions, when I get a really good reader/editor.)

To say the same thing more briefly: I think the Journal of Digital Humanities has the model about right. Articles in JDH tend to begin life as blog posts. They tend to get kicked around pretty vigorously by commenters: that’s the “open” part of the process, where most of the constructive criticism, learning, and improvement take place. Then they’re selected by the editors of JDH, which to my mind is “review.” The editors may not have to give detailed suggestions for revision, because the give-and-take of the blog stage has probably already shown the author where she wants to expand or rethink. The two stages (“open” and “review”) are loosely related, but not fused. As I understand the process, selection is partly (but only partly) driven by the amount of discussion a post stirred up.

If you ask, why not fuse the the two stages? I would say, because they’re doing different sorts of work. I think open intellectual exchange is most fun when it feels like a reciprocal exchange of views rather than a process where “I’m asking you to review my work.” So I’d rather not force it to count as a review process. Conversely, I suspect there are good reasons for the editorial selection process to be less than perfectly open. Majorities should rule in politics, but perhaps not always in academic debate.

But if people want to keep trying to fuse the “open” part with the “review” part, I’ve got no objection. It’s worth a try, and trying does no harm.

collection-building

It’s the data: a plan of action.

Post author By tedunderwood
Post date May 13, 2012
9 Comments on It’s the data: a plan of action.

I’m still fairly new at this gig, so take the following with a grain of salt. But the more I explore the text-mining side of DH, the more I wonder whether we need to rethink our priorities.

Over the last ten years we’ve been putting a lot of effort into building tools and cyberinfrastructure. And that’s been valuable: projects like MONK and Voyant play a crucial role in teaching people what’s possible. (I learned a lot from them myself.) But when I look around for specific results produced by text-mining, I tend to find that they come in practice from fairly simple, ad-hoc tools, applied to large datasets.

Ben Schmidt’s blog Sapping Attention is a good source of examples. Ben has discovered several patterns that really have the potential to change disciplines. For instance, he’s mapped the distribution of gender in nineteenth-century collections, and assessed the role of generational succession in vocabulary change. To do this, he hasn’t needed natural language processing, or TEI, or even topic modeling. He tends to rely on fairly straightforward kinds of corpus comparison. The leverage he’s getting comes ultimately from his decision to go ahead and build corpora as broad as possible using existing OCR.

I think that’s the direction to go right now. Moreover, before 1923 it doesn’t require any special agreement with publishers. There’s a lot of decent OCR in the public domain, because libraries can now produce cleaner copy than Google used to. Yes, some cleanup is still needed: running headers need to be removed, and the OCR needs to be corrected in period-sensitive ways. But it’s easier than people think to do that reliably. (You get a lot of clues, for instance, from cleaning up a whole collection at once. That way, the frequency of a particular form across the collection can help your corrector decide whether it’s an OCR error or a proper noun.)

In short, I think we should be putting a bit more collective effort into data preparation. Moreover, it seems to me that there’s a discernible sweet spot between vast collections of unreliable OCR and small collections of carefully-groomed TEI. What we need are collections in the 5,000 – 500,000 volume range, cleaned up to at least (say) 95% recall and 99% precision. Precision is more important than recall, because false negatives drop out of many kinds of analysis — as long as they’re randomly distributed (i.e. you can’t just ignore the f/s problem in the 18c). Collections of that kind are going to generate insights that we can’t glimpse as individual readers. They’ll be especially valuable once we enrich the metadata with information about (for instance) genre, gender, and nationality. I’m not confident that we can crowdsource OCR correction (it’s an awful lot of work), but I am confident that we could crowdsource some light enrichment of metadata.

So this is less a manifesto than a plan of action. I don’t think we need a center or a grant for this kind of thing: all we need is a coalition of the willing. I’ve asked HathiTrust for English-language OCR in the 18th and 19th centuries; once I get it, I’ll clean it up and make the cleaned version publicly available (as far as legally possible, which I think is pretty far). Then I’ll invite researchers to crowdsource metadata in some fairly low-tech way, and share the enriched metadata with everyone who participated in the crowdsourcing.

I would eagerly welcome suggestions about the kinds of metadata we ought to be recording (for instance, the genre categories we ought to use). Questions about selection/representativeness are probably better handled by individual researchers; I don’t think it’s possible to define a collective standard on that point, because people have different goals. Instead, I’ll simply take everything I can get, measure OCR quality, and allow people to define their own selection criteria. Researchers who want to produce a specific balance between X and Y can always do that by selecting a subset of the collection, or by combining it with another collection of their own.

Uncategorized

OT: politics, social media, and the academy.

Post author By tedunderwood
Post date May 6, 2012
19 Comments on OT: politics, social media, and the academy.

This is a blog about text mining, but from time to time I’m going to allow myself to wander off topic, briefly. At the moment, I think social media are adding a few interesting twists to an old question about the relationship between academic politics and politics-politics.

wisconsin-protests — Protests outside the Wisconsin State Capitol, Feb 2011.

It’s perhaps never a good idea to confuse politics with communicative rationality. Recently, in the United States, it isn’t even clear that all parties share a minimal respect for democratic norms. One side is willing to obstruct the right to vote, to lie about scientifically ascertainable fact, and to convert US attorneys (when they’re in power) into partisan enforcers. In circumstances like this, observers of good faith don’t need to spend a lot of time “debating” politics, because the other side isn’t debating. The only thing worth debating is how to fight back. And in a fight, dispassionate self-criticism becomes less important than solidarity.

Personally, I don’t mind a good fight with clearly-drawn moral lines. But this same clarity can be a bad thing for the academy. Dispassionate debate is what our institution is designed to achieve. If contemporary political life teaches us that “debate” is usually a sham, staged to produce an illusion of equivalence between A) fact and B) bullshit — then we may start to lose faith in our own guiding principles.

This opens up a whole range of questions. But maybe the most interesting question for likely readers of this blog will involve the role of social media. I think the web has proven itself a good tool for grassroots push-back against corporate power; we’re all familiar with successful campaigns against SOPA and Susan G. Komen. But social media also work by harnessing the power of groupthink. “Click like.” “Share.” “Retweet.” This doesn’t bother me where politics itself is concerned; political life always entails a decision to “hang together or hang separately.”

But I’m uneasy about extending the same strategy to academic politics, because our main job, in the academy, is debate rather than solidarity. I hesitate to use Naomi Schaefer Riley’s recent blog post to the Chronicle as an example, because it’s not in any sense a model of the virtues of debate. It was a hastily-tossed-off, sneering attack on junior scholars that failed to engage in any depth with the texts it attacked. Still, I’m uncomfortable when I see academics harnessing the power of social media to discourage The Chronicle from publishing Riley.

There was, after all, an idea buried underneath Riley’s sneers. It could have been phrased as a question about the role of politics in the humanities. Political content has become more central to humanistic research, at the same time as actual political debate has become less likely (for reasons sketched above). The result is that a lot of dissertations do seem to be proceeding toward a predetermined conclusion. This isn’t by any means a problem only in Black Studies, and Riley’s reasons for picking on Black Studies probably won’t bear close examination.

Still, I’m not persuaded that we would improve the academy by closing publications like the Chronicle to Riley’s kind of critique. Attacks on academic institutions can raise valid questions, even when they are poorly argued, sneering, and unfair. (E.g., I wouldn’t be writing this blog post if it weren’t for the outcry over Riley.) So in the end I agree with Liz McMillen’s refusal to take down the post.

But this particular incident is not of great significance. I want to raise a more general question about the role that technologies of solidarity should play in academic politics. We’ve become understandably cynical about the ideal of “open debate” in politics and journalism. How cynical are we willing to become about its place in academia? It’s a question that may become especially salient if we move toward more public forms of review. Would we be comfortable, for instance, with a petition directed at a particular scholarly journal, urging them not to publish article(s) by a particular author?

[11 p.m. May 6th: This post was revised after initial publication, mainly for brevity. I also made the final paragraph a little more pointed.]

[Update May 7: The Chronicle has asked Schaefer Riley to leave the blog. It’s a justifiable decision, since she wrote a very poorly argued post. But it also does convince me that social media are acquiring a new power to shape the limits of academic debate. That’s a development worth watching.]

[Update May 8th: Kevin Drum at Mother Jones weighs in on the issue.]

collection-building

The obvious thing we’re lacking.

Post author By tedunderwood
Post date April 26, 2012
6 Comments on The obvious thing we’re lacking.

I love Karen Coyle’s idea that we should make OCR usable by identifying the best-available copy of each text. It’s time to start thinking about this kind of thing. Digital humanists have been making big claims about our ability to interpret large collections. But — outside of a few exemplary projects like EEBO-TCP — we really don’t have free access to the kind of large, coherent collections that our rhetoric would imply. We’ve got feet of clay on this issue.

12-mrs-eaves-just-ligatures — Do you think the 'ct' ligature looks a little like an ampersand? Well, so do OCR engines.

Moreover, this wouldn’t be a difficult problem to address. I think it can be even simpler than Coyle suggests. In many cases, libraries have digitized multiple copies of a single edition. The obvious, simple thing to do is just:

Measure OCR quality — automatically, using a language model rather than ground truth — and associate a measurement of OCR quality with each bibliographic record.

This simple metric would save researchers a huge amount of labor, because a scholar could use an API to request “all the works you have between 1790 and 1820 that are above 90% probable accuracy” or “the best available copy of each edition in this period,” making it much easier to build a meaningfully normalized corpus. (This may be slightly different from Coyle’s idea about “urtexts,” because I’m talking about identifying the best copy of an edition rather than the best edition of a title.) And of course a metric destroys nothing: if you want to talk about print culture without filtering out poor OCR, all that metadata is still available. All this would do is empower researchers to make their own decisions.

One could go even further, and construct a “Frankenstein” edition by taking the best version of each page in a given edition. Or one could improve OCR with post-processing. But I think those choices can be left to individual research projects and repositories. The only part of this that really does need to be a collective enterprise is an initial measurement of OCR quality that gets associated with each bibliographic record and exposed to the API. That measurement would save research assistants thousands of hours of labor picking “the cleanest version of X.” I think it’s the most obvious thing we’re lacking.

[Postscript: Obviously, researchers can do this for themselves by downloading everything in period X, measuring OCR quality, and then selecting copies accordingly. In fact, I’m getting ready to build that workflow this summer. But this is going to take time and consume a lot of disk space, and it’s really the kind of thing an API ought to be doing for us.]

DH as a social phenomenon undigitized humanities

Why DH has no future.

Digital humanities is about eleven years old — counting from John Unsworth’s coinage of the phrase in 2001 — which perhaps explains why it has just discovered mortality and is anxiously contemplating its own.

tomstone — Creative commons BY-NC-SA 1.0: bigadventures.

Steve Ramsay tried to head off this crisis by advising digital humanists that a healthy community “doesn’t concern itself at all with the idea that it will one day be supplanted by something else.” This was ethically wise, but about as effective as curing the hiccups by not thinking about elephants. Words like “supplant” have a way of sticking in your memory. Alex Reid then gave the discussion a twist by linking the future of DH to the uncertain future of the humanities themselves.

Meanwhile, I keep hearing friends speculate that the phrase “digital humanities” will soon become meaningless, since “everything will be digital,” and the adjective will be emptied out.

In thinking about these eschatological questions, I start from Matthew Kirschenbaum’s observation that DH is not a single intellectual project but a tactical coalition. Just for starters, humanists can be interested in digital technology a) as a way to transform scholarly communication, b) as an object of study, or c) as a means of analysis. These are distinct intellectual projects, although they happen to overlap socially right now because they all require skills and avocations that are not yet common among humanists.

This observation makes it pretty clear how “the digital humanities” will die. The project will fall apart as soon as it’s large enough for falling apart to be an option.

A) Transforming scholarly communication. This is one part of the project where I agree that “soon everyone will be a digital humanist.” The momentum of change here is clear, and there’s no reason why it shouldn’t be generalized to academia as a whole. As it does generalize, it will no longer be seen as DH.

B) Digital objects of study. It’s much less clear to me that all humanists are going to start thinking about the computational dimension of new cultural forms (videogames, recommendation algorithms, and so on). Here I would predict the classic sort of slow battle that literary modernism, for instance, had to wage in order to be accepted in the curriculum. The computational dimension of culture is going to become increasingly important, but it can’t simply displace the rest of cultural history, and not all humanists will want to acquire the algorithmic literacy required to critique it. So we could be looking at a permanent tension here, whether it ends up being a division within or between disciplines.

C) Digital means of analysis. The part of the project closest to my heart also has the murkiest future. If you forced me to speculate, I would guess that projects like text mining and digital history may remain somewhat marginal in departments of literature and history. I’m confident that we’ll build a few tools that get widely adopted by humanists; topic modeling, for instance, may become a standard way to explore large digital collections. But I’m not confident that the development of new analytical strategies will ever be seen as a central form of humanistic activity. The disciplinary loyalties of people in this subfield may also be complicated by the relatively richer funding opportunities in neighboring disciplines (like computer science).

So DH has no future, in the long run, because the three parts of DH probably confront very different kinds of future. One will be generalized; one will likely settle in for trench warfare; and one may well get absorbed by informatics. [Or become a permanent trade mission to informatics. See also Matthew Wilkens’ suggestion in the comments below. – Ed.] But personally, I’m in no rush to see any of this happen. My odds of finding a disciplinary home in the humanities will be highest as long as the DH coalition holds together; so here’s a toast to long life and happiness. We are, after all, only eleven.

[Update: In an earlier version of this post I dated Schreibman, Siemens, and Unsworth’s Companion to Digital Humanities to 2001, as Wikipedia does. But it appears that 2004 is the earliest publication date.]

[Update April 15th: I find that people are receiving this as a depressing post. But I truly didn’t mean it that way. I was trying to suggest that the projects currently grouped together as “DH” can transform the academy in a wide range of ways — ways that don’t even have to be confined to “the humanities.” So I’m predicting the death of DH only in an Obi-Wan Kenobi sense! I blame that picture of a drowned tombstone for making this seem darker than it is — a little too evocative …]

19c Bayesian topic modeling discovery strategies topic modeling

Topic modeling made just simple enough.

Post author By tedunderwood
Post date April 7, 2012
86 Comments on Topic modeling made just simple enough.

Right now, humanists often have to take topic modeling on faith. There are several good posts out there that introduce the principle of the thing (by Matt Jockers, for instance, and Scott Weingart). But it’s a long step up from those posts to the computer-science articles that explain “Latent Dirichlet Allocation” mathematically. My goal in this post is to provide a bridge between those two levels of difficulty.

Computer scientists make LDA seem complicated because they care about proving that their algorithms work. And the proof is indeed brain-squashingly hard. But the practice of topic modeling makes good sense on its own, without proof, and does not require you to spend even a second thinking about “Dirichlet distributions.” When the math is approached in a practical way, I think humanists will find it easy, intuitive, and empowering. This post focuses on LDA as shorthand for a broader family of “probabilistic” techniques. I’m going to ask how they work, what they’re for, and what their limits are.

How does it work? Say we’ve got a collection of documents, and we want to identify underlying “topics” that organize the collection. Assume that each document contains a mixture of different topics. Let’s also assume that a “topic” can be understood as a collection of words that have different probabilities of appearance in passages discussing the topic. One topic might contain many occurrences of “organize,” “committee,” “direct,” and “lead.” Another might contain a lot of “mercury” and “arsenic,” with a few occurrences of “lead.” (Most of the occurrences of “lead” in this second topic, incidentally, are nouns instead of verbs; part of the value of LDA will be that it implicitly sorts out the different contexts/meanings of a written symbol.)

Of course, we can’t directly observe topics; in reality all we have are documents. Topic modeling is a way of extrapolating backward from a collection of documents to infer the discourses (“topics”) that could have generated them. (The notion that documents are produced by discourses rather than authors is alien to common sense, but not alien to literary theory.) Unfortunately, there is no way to infer the topics exactly: there are too many unknowns. But pretend for a moment that we had the problem mostly solved. Suppose we knew which topic produced every word in the collection, except for this one word in document D. The word happens to be “lead,” which we’ll call word type W. How are we going to decide whether this occurrence of W belongs to topic Z?

We can’t know for sure. But one way to guess is to consider two questions. A) How often does “lead” appear in topic Z elsewhere? If “lead” often occurs in discussions of Z, then this instance of “lead” might belong to Z as well. But a word can be common in more than one topic. And we don’t want to assign “lead” to a topic about leadership if this document is mostly about heavy metal contamination. So we also need to consider B) How common is topic Z in the rest of this document?

Here’s what we’ll do. For each possible topic Z, we’ll multiply the frequency of this word type W in Z by the number of other words in document D that already belong to Z. The result will represent the probability that this word came from Z. Here’s the actual formula:

Simple enough. Okay, yes, there are a few Greek letters scattered in there, but they aren’t terribly important. They’re called “hyperparameters” — stop right there! I see you reaching to close that browser tab! — but you can also think of them simply as fudge factors. There’s some chance that this word belongs to topic Z even if it is nowhere else associated with Z; the fudge factors keep that possibility open. The overall emphasis on probability in this technique, of course, is why it’s called probabilistic topic modeling.

Now, suppose that instead of having the problem mostly solved, we had only a wild guess which word belonged to which topic. We could still use the strategy outlined above to improve our guess, by making it more internally consistent. We could go through the collection, word by word, and reassign each word to a topic, guided by the formula above. As we do that, a) words will gradually become more common in topics where they are already common. And also, b) topics will become more common in documents where they are already common. Thus our model will gradually become more consistent as topics focus on specific words and documents. But it can’t ever become perfectly consistent, because words and documents don’t line up in one-to-one fashion. So the tendency for topics to concentrate on particular words and documents will eventually be limited by the actual, messy distribution of words across documents.

That’s how topic modeling works in practice. You assign words to topics randomly and then just keep improving the model, to make your guess more internally consistent, until the model reaches an equilibrium that is as consistent as the collection allows.

What is it for? Topic modeling gives us a way to infer the latent structure behind a collection of documents. In principle, it could work at any scale, but I tend to think human beings are already pretty good at inferring the latent structure in (say) a single writer’s oeuvre. I suspect this technique becomes more useful as we move toward a scale that is too large to fit into human memory.

So far, most of the humanists who have explored topic modeling have been historians, and I suspect that historians and literary scholars will use this technique differently. Generally, historians have tried to assign a single label to each topic. So in mining the Richmond Daily Dispatch, Robert K. Nelson looks at a topic with words like “hundred,” “cotton,” “year,” “dollars,” and “money,” and identifies it as TRADE — plausibly enough. Then he can graph the frequency of the topic as it varies over the print run of the newspaper.

As a literary scholar, I find that I learn more from ambiguous topics than I do from straightforwardly semantic ones. When I run into a topic like “sea,” “ship,” “boat,” “shore,” “vessel,” “water,” I shrug. Yes, some books discuss sea travel more than others do. But I’m more interested in topics like this:

You can tell by looking at the list of words that this is poetry, and plotting the volumes where the topic is prominent confirms the guess.

This topic is prominent in volumes of poetry from 1815 to 1835, especially in poetry by women, including Felicia Hemans, Letitia Landon, and Caroline Norton. Lord Byron is also well represented. It’s not really a “topic,” of course, because these words aren’t linked by a single referent. Rather it’s a discourse or a kind of poetic rhetoric. In part it seems predictably Romantic (“deep bright wild eye”), but less colorful function words like “where” and “when” may reveal just as much about the rhetoric that binds this topic together.

A topic like this one is hard to interpret. But for a literary scholar, that’s a plus. I want this technique to point me toward something I don’t yet understand, and I almost never find that the results are too ambiguous to be useful. The problematic topics are the intuitive ones — the ones that are clearly about war, or seafaring, or trade. I can’t do much with those.

Now, I have to admit that there’s a bit of fine-tuning required up front, before I start getting “meaningfully ambiguous” results. In particular, a standard list of stopwords is rarely adequate. For instance, in topic-modeling fiction I find it useful to get rid of at least the most common personal pronouns, because otherwise the difference between 1st and 3rd person point-of-view becomes a dominant signal that crowds out other interesting phenomena. Personal names also need to be weeded out; otherwise you discover strong, boring connections between every book with a character named “Richard.” This sort of thing is very much a critical judgment call; it’s not a science.

I should also admit that, when you’re modeling fiction, the “author” signal can be very strong. I frequently discover topics that are dominated by a single author, and clearly reflect her unique idiom. This could be a feature or a bug, depending on your interests; I tend to view it as a bug, but I find that the author signal does diffuse more or less automatically as the collection expands.

What are the limits of probabilistic topic modeling? I spent a long time resisting the allure of LDA, because it seemed like a fragile and unnecessarily complicated technique. But I have convinced myself that it’s both effective and less complex than I thought. (Matt Jockers, Travis Brown, Neil Fraistat, and Scott Weingart also deserve credit for convincing me to try it.)

This isn’t to say that we need to use probabilistic techniques for everything we do. LDA and its relatives are valuable exploratory methods, but I’m not sure how much value they will have as evidence. For one thing, they require you to make a series of judgment calls that deeply shape the results you get (from choosing stopwords, to the number of topics produced, to the scope of the collection). The resulting model ends up being tailored in difficult-to-explain ways by a researcher’s preferences. Simpler techniques, like corpus comparison, can answer a question more transparently and persuasively, if the question is already well-formed. (In this sense, I think Ben Schmidt is right to feel that topic modeling wouldn’t be particularly useful for the kinds of comparative questions he likes to pose.)

Moreover, probabilistic techniques have an unholy thirst for memory and processing time. You have to create several different variables for every single word in the corpus. The models I’ve been running, with roughly 2,000 volumes, are getting near the edge of what can be done on an average desktop machine, and commonly take a day. To go any further with this, I’m going to have to beg for computing time. That’s not a problem for me here at Urbana-Champaign (you may recall that we invented HAL), but it will become a problem for humanists at other kinds of institutions.

Probabilistic methods are also less robust than, say, vector-space methods. When I started running LDA, I immediately discovered noise in my collection that had not previously been a problem. Running headers at the tops of pages, in particular, left traces: until I took out those headers, topics were suspiciously sensitive to the titles of volumes. But LDA is sensitive to noise, after all, because it is sensitive to everything else! On the whole, if you’re just fishing for interesting patterns in a large collection of documents, I think probabilistic techniques are the way to go.

Where to go next
The standard implementation of LDA is the one in MALLET. I haven’t used it yet, because I wanted to build my own version, to make sure I understood everything clearly. But MALLET is better. If you want a few examples of complete topic models on collections of 18/19c volumes, I’ve put some models, with R scripts to load them, in my github folder.

If you want to understand the technique more deeply, the first thing to do is to read up on Bayesian statistics. In this post, I gloss over the Bayesian underpinnings of LDA because I think the implementation (using a strategy called Gibbs sampling, which is actually what I described above!) is intuitive enough without them. And this might be all you need! I doubt most humanists will need to go further. But if you do want to tinker with the algorithm, you’ll need to understand Bayesian probability.

David Blei invented LDA, and writes well, so if you want to understand why this technique has “Dirichlet” in its name, his works are the next things to read. I recommend his Introduction to Probabilistic Topic Models. It recently came out in Communications of the ACM, but I think you get a more readable version by going to his publication page (link above) and clicking the pdf link at the top of the page.

Probably the next place to go is “Rethinking LDA: Why Priors Matter,” a really thoughtful article by Hanna Wallach, David Mimno, and Andrew McCallum that explains the “hyperparameters” I glossed over in a more principled way.

Then there are a whole family of techniques related to LDA — Topics Over Time, Dynamic Topic Modeling, Hierarchical LDA, Pachinko Allocation — that one can explore rapidly enough by searching the web. In general, it’s a good idea to approach these skeptically. They all promise to do more than LDA does, but they also add additional assumptions to the model, and humanists are going to need to reflect carefully about which assumptions we actually want to make. I do think humanists will want to modify the LDA algorithm, but it’s probably something we’re going to have to do for ourselves; I’m not convinced that computer scientists understand our problems well enough to do this kind of fine-tuning.

19c Bayesian topic modeling methodology poetic diction topic modeling Uncategorized

What kinds of “topics” does topic modeling actually produce?

Post author By tedunderwood
Post date April 1, 2012
17 Comments on What kinds of “topics” does topic modeling actually produce?

I’m having an interesting discussion with Lisa Rhody about the significance of topic modeling at different scales that I’d like to follow up with some examples.

I’ve been doing topic modeling on collections of eighteenth- and nineteenth-century volumes, using volumes themselves as the “documents” being modeled. Lisa has been pursuing topic modeling on a collection of poems, using individual poems as the documents being modeled.

The math we’re using is probably similar. I believe Lisa is using MALLET. I’m using a version of Latent Dirichlet Allocation that I wrote in Java so I could tinker with it.

But the interesting question we’re exploring is this: How does the meaning of LDA change when it’s applied to writing at different scales of granularity? Lisa’s documents (poems) are a typical size for LDA: this technique is often applied to identify topics in newspaper articles, for instance. This is a scale that seems roughly in keeping with the meaning of the word “topic.” We often assume that the topic of written discourse changes from paragraph to paragraph, “topic sentence” to “topic sentence.”

By contrast, I’m using documents (volumes) that are much larger than a paragraph, so how is it possible to produce topics as narrowly defined as this one?

This is based on a generically diverse collection of 1,782 19c volumes, not all of which are plotted here (only the volumes where the topic is most prominent are plotted; the gray line represents an aggregate frequency including unplotted volumes). The most prominent words in this topic are “mother, little, child, children, old, father, poor, boy, young, family.” It’s clearly a topic about familial relationships, and more specifically about parent-child relationships. But there aren’t a whole lot of books in my collection specifically about parent-child relationships! True, the most prominent books in the topic are A. F. Chamberlain’s The Child and Childhood in Folk Thought (1896) and Alice Earl Morse’s Child Life in Colonial Days (1899), but most of the rest of the prominent volumes are novels — by, for instance, Catharine Sedgwick, William Thackeray, Louisa May Alcott, and so on. Since few novels are exclusively about parent-child relations, how can the differences between novels help LDA identify this topic?

The answer is that the LDA algorithm doesn’t demand anything remotely like a one-to-one relationship between documents and topics. LDA uses the differences between documents to distinguish topics — but not by establishing a one-to-one mapping. On the contrary, every document contains a bit of every topic, although it contains them in different proportions. The numerical variation of topic proportions between documents provides a kind of mathematical leverage that distinguishes topics from each other.

The implication of this is that your documents can be considerably larger than the kind of granularity you’re trying to model. As long as the documents are small enough that the proportions between topics vary significantly from one document to the next, you’ll get the leverage you need to discriminate those topics. Thus you can model a collection of volumes and get topics that are not mere “subject classifications” for volumes.

Now, in the comments to an earlier post I also said that I thought “topic” was not always the right word to use for the categories that are produced by topic modeling. I suggested that “discourse” might be better, because topics are not always unified semantically. This is a place where Lisa starts to question my methodology a little, and I don’t blame her for doing so; I’m making a claim that runs against the grain of a lot of existing discussion about “topic modeling.” The computer scientists who invented this technique certainly thought they were designing it to identify semantically coherent “topics.” If I’m not doing that, then, frankly, am I using it right? Let’s consider this example:

This is based on the same generically diverse 19c collection. The most prominent words are “love, life, soul, world, god, death, things, heart, men, man, us, earth.” Now, I would not call that a semantically coherent topic. There is some religious language in there, but it’s not about religion as such. “Love” and “heart” are mixed in there; so are “men” and “man,” “world” and “earth.” It’s clearly a kind of poetic diction (as you can tell from the color of the little circles), and one that increases in prominence as the nineteenth century goes on. But you would be hard pressed to identify this topic with a single concept.

Does that mean topic modeling isn’t working well here? Does it mean that I should fix the system so that it would produce topics that are easier to label with a single concept? Or does it mean that LDA is telling me something interesting about Victorian poetry — something that might be roughly outlined as an emergent discourse of “spiritual earnestness” and “self-conscious simplicity”? It’s an open question, but I lean toward the latter alternative. (By the way, the writers most prominently involved here include Christina Rossetti, Algernon Swinburne, and both Brownings.)

In an earlier comment I implied that the choice between “semantic” topics and “discourses” might be aligned with topic modeling at different scales, but I’m not really sure that’s true. I’m sure that the document size we choose does affect the level of granularity we’re modeling, but I’m not sure how radically it affects it. (I believe Matt Jockers has done some systematic work on that question, but I’ll also be interested to see the results Lisa gets when she models differences between poems.)

I actually suspect that the topics identified by LDA probably always have the character of “discourses.” They are, technically, “kinds of language that tend to occur in the same discursive contexts.” But a “kind of language” may or may not really be a “topic.” I suspect you’re always going to get things like “art hath thy thou,” which are better called a “register” or a “sociolect” than they are a “topic.” For me, this is not a problem to be fixed. After all, if I really want to identify topics, I can open a thesaurus. The great thing about topic modeling is that it maps the actual discursive contours of a collection, which may or may not line up with “concepts” any writer ever consciously held in mind.

Computer scientists don’t understand the technique that way.* But on this point, I think we literary scholars have something to teach them.

On the collective course blog for English 581 I have some other examples of topics produced at a volume level.

*[UPDATE April 3, 2012: Allen Riddell rightly points out in the comments below that Blei’s original LDA article is elegantly agnostic about the significance of the “topics” — which are at bottom just “latent variables.” The word “topic” may be misleading, but computer scientists themselves are often quite careful about interpretation.]

Documentation / open data:
I’ve put the topic model I used to produce these visualizations on github. It’s in the subfolder 19th150topics under folder BrowseLDA. Each folder contains an R script that you run; it then prompts you to load the data files included in the same folder, and allows you to browse around in the topic model, visualizing each topic as you go.

I have also pushed my Java code for LDA up to github. But really, most people are better off with MALLET, which is infinitely faster and has hyperparameter optimization that I haven’t added yet. I wrote this just so that I would be able to see all the moving parts and understand how they worked.

18c 19c Bayesian topic modeling fiction Romantic-era writing

A touching detail produced by LDA …

Post author By tedunderwood
Post date March 25, 2012
7 Comments on A touching detail produced by LDA …

I’m getting ahead of myself with this post, because I don’t have time to explain everything I did to produce this. But it was just too striking not to share.

Basically, I’m experimenting with Latent Dirichlet Allocation, and I’m impressed. So first of all, thanks to Matt Jockers, Travis Brown, Neil Fraistat, and everyone else who tried to convince me that Bayesian methods are better. I’ve got to admit it. They are.

But anyway, in a class I’m teaching we’re using LDA on a generically diverse collection of 1,853 volumes between 1751 and 1903. The collection includes fiction, poetry, drama, and a limited amount of nonfiction (just biography). We’re stumbling on a lot of fascinating things, but this was slightly moving. Here’s the graph for one particular topic.

The circles and X’s are individual volumes. Blue is fiction, green is drama, pinkish purple is poetry, black biography. Only the volumes where this topic turned out to be prominent are plotted, because if you plot all 1,853 it’s just a blurry line at the bottom of the image. The gray line is an aggregate frequency curve, which is not related in any very intelligible way to the y-axis. (Work in progress …) As you can see. this topic is mostly prominent in fiction around the year 1800. Here are the top 50 words in the topic:

But here’s what I find slightly moving. The x’s at the top of the graph are the 10 works in the collection where the topic was most prominent. They include, in order, Mary Wollstonecraft Shelley, Frankenstein, Mary Wollstonecraft, Mary, William Godwin, St. Leon, Mary Wollstonecraft Shelley, Lodore, William Godwin, Fleetwood, William Godwin, Mandeville, and Mary Wollstonecraft Shelley, Falkner.

In short, this topic is exemplified by a family! Mary Hays does intrude into the family circle with Memoirs of Emma Courtney, but otherwise, it’s Mary Wollstonecraft, William Godwin, and their daughter.

Other critics have of course noticed that M. W. Shelley writes “Godwinian novels.” And if you go further down the list of works, the picture becomes less familial (Helen Maria Williams and Thomas Holcroft butt in, as well as P. B. Shelley). Plus, there’s another topic in the model (“myself these should situation”) that links William Godwin more closely to Charles Brockden Brown than it does to his wife or daughter. And LDA isn’t graven on stone; every time you run topic modeling you’re going to get something slightly different. But still, this is kind of a cool one. “Mind feelings heart felt” indeed.

18c 19c genre comparison poetic diction

Etymology and nineteenth-century poetic diction; or, singing the shadow of the bitter old sea.

Post author By tedunderwood
Post date March 9, 2012
4 Comments on Etymology and nineteenth-century poetic diction; or, singing the shadow of the bitter old sea.

In a couple of recent posts, I argued that fiction and poetry became less similar to nonfiction prose over the period 1700-1900. But because I only measured genres’ distance from each other, I couldn’t say much substantively about the direction of change. Toward the end of the second post, though, I did include a graph that hinted at a possible cause:

The older part of the lexicon (mostly words derived from Old English) gradually became more common in poetry, fiction, and drama than in nonfiction prose. This may not be the only reason for growing differentiation between literary and nonliterary language, but it seems worth exploring. (I should note that function words are excluded from this calculation for reasons explained below; we’re talking about verbs, nouns, and adjectives — not about a rising frequency of “the.”)

Why would genres become etymologically different? Well, it appears that words of different origins are associated in contemporary English with different registers (varieties of language appropriate for a particular social situation). Words of Old English provenance get used more often in speech than in writing — and in writing they are (now) used more often in narrative than in exposition. Moreover, writers learn to produce this distinction as they get older; there isn’t a marked difference for students in elementary school. But as they advance to high school, students learn to use Latinate words in formal expository writing (Bar-Ilan and Berman, 2007).

It’s not hard to see why words of Old English origin might be associated with spoken language. English was for 200 years (1066-1250) almost exclusively spoken. The learned part of the Old English lexicon didn’t survive this period. Instead, when English began to be used again in writing, literate vocabulary was borrowed from French and Latin. As a result, etymological distinctions in English tend also to be distinctions between different social contexts of language use.

Instead of distinguishing “Germanic” and “Latinate” diction here, I have used the first attested date for each word, choosing 1150 as a dividing line because it’s the midpoint of the period when English was not used in writing. Of course pre-1150 words are mostly from Old English, but I prefer to divide based on date-of-entry because that highlights the history of writing rather than a spurious ethnic mystique. (E.g., “Anglo-Saxon is a livelier tongue than Latin, so use Anglo-Saxon words.” — E. B. White.) But the difference isn’t material. You could even just measure the average length of words in different genres and get results that are close to the results I’m graphing here (the correlation between the pre/post-1150 ratio and average word length is often -.85 or lower).

The bottom line is this: using fewer pre-1150 words tends to make diction more overtly literate or learned. Using more of them makes diction less overtly learned, and perhaps closer to speech. It would be dangerous to assume much more: people may think that Old English words are “concrete” — but this isn’t true, for instance, of “word” or “true.”

What can we learn by graphing this aspect of diction?

In the period 1700-1900, I think we learn three interesting things:

Poetry and fiction reverse that process in the nineteenth century, and develop a diction that is markedly less learned than other kinds of writing — or than their own past history.

But they do that to different degrees, and as a result the overall story is one of increasing differentiation — not just between “literary” and “nonliterary” diction — but between poetry and fiction as well.

I’m fascinated by this picture. It suggests that the difference linguists have observed between the registers of exposition and narrative may be a relatively recent development. It also raises interesting questions about “literariness” in the eighteenth and nineteenth centuries. For instance, contrast this picture to the standard story where “poetic diction” is an eighteenth-century refinement that the nineteenth century learns to dispense with. Where the etymological dimension of diction is concerned, that story doesn’t fit the evidence. On the contrary, nineteenth-century poetry differentiates itself from the diction of prose in a new and radical way: by the end of the century, the older part of the lexicon has become more than 2.5 times more prominent, on average, in verse than it is in nonfiction prose.

I could speculate about why this happened, but I don’t really know yet. What I can do is give a little more descriptive detail. For instance, if pre-1150 words became more common in 19c poetry … which words, exactly, were involved? One way to approach that is to ask which individual words correlate most strongly with the pre/post-1150 ratio. We might focus especially, for instance, on the rising trend in poetry from the middle of the eighteenth century to 1900. If you sort the top 10,000 words in the poetry collection by correlation with yearly values of the pre/post ratio, you get a list like this:

But the precise correlation coefficients don’t matter as much as an overall picture of diction, so I’ll simply list the hundred words that correlate most strongly with the pre/post-1150 ratio in poetry from 1755 to 1900:

We’re looking mostly at a list of pre-1150 words, with a few exceptions (“face,” “flower,” “surely”). That’s not an inevitable result; if the etymological trend had been a side-effect of something mostly unrelated to linguistic register (say, a vogue for devotional poetry), then sorting the top 10,000 words by correlation with the trend would reveal a list of words associated with its underlying (religious) cause. But instead we’re seeing a trend that seems to have a coherent sociolinguistic character. That’s not just a feature of the top 100 words: the average pre-1150 word is located 2210 places higher on this list than the average post-1150 word.

It’s not, however, simply a list of common Anglo-Saxon words. The list clearly reflects a particular model of “poetic diction,” although the nature of that model is not easy to describe. It involves an odd mixture of nouns for large natural phenomena (wind, sea, rain, water, moon, sun, star, stars, sunset, sunrise, dawn, morning, days, night, nights) and verbs that express a subjective relation (sang, laughed, dreamed, seeing, kiss, kissed, heard, looked, loving, stricken). [Afterthought: I don’t think we have any Hopkins in our collection, but it sounds like my computer is parodying Gerard Manley Hopkins.] There’s also a bit of explicitly archaic Wardour Street in there (yea, nay, wherein, thereon, fro).

Here, by contrast, are the words at the bottom of the list — the ones that correlate negatively with the pre/post-1150 trend, because they are less common, on average, in years where that trend spikes.

There’s a lot that could be said about this list, but one thing that leaps out is an emphasis on social competition. Pomp, power, superior, powers, boast, bestow, applause, grandeur, taste, pride, refined, rival, fortune, display, genius, merit, talents. This is the language of poems that are not bashful about acknowledging the relationship between “social” distinction and the “arts” “inspired” by the “muse” — a theme entirely missing (or at any rate disavowed) in the other list. So we’re getting a fairly clear picture of a thematic transformation in the concerns of poetry from 1755 to 1900. But these lists are generated strictly by correlation with the unsmoothed year-to-year variation of an etymological trend! Moreover, the lists have themselves an etymological character. There are admittedly a few pre-1150 words in this list of negative correlators (mind, oft, every), but by and large it’s a list of words derived from French or Latin after 1150.

I think the apparent connection between sociolinguistic and thematic issues is the really interesting part of this. It begins to hint that the broader shift in poetic diction (using words from the older part of the lexicon to differentiate poetry from prose) had itself an unacknowledged social rationale — which was to disavow poetry’s connection to cultural distinction, and foreground instead a simplified individual subjectivity. I’m admittedly speculating a little here, and there’s a great deal more that could be said both about poetry and about parallel trends in fiction — but I’ve said enough for one blog post.

A couple of quick final notes. You’re wondering, what about drama?

Our collection of drama in the nineteenth century is actually too sparse to draw any conclusions yet, but there’s the trend line so far, if you’re interested.

You’re also wondering, how were pre- and post-1150 words actually sorted out? I made a list of the 10,500 most common words in the collection, and mined etymologies for them using a web-crawler on Dictionary.com. I excluded proper nouns, abbreviations, and words that entered English after 1699. I also excluded function words (determiners, prepositions, conjunctions, pronouns, and the verb to be) because as Bar-Ilan and Berman say, “register variation is essentially a matter of choice — of selecting high-level more formal alternatives instead of everyday, colloquial items or vice versa” (15). There is generally no alternative to prepositions, pronouns, etc, so they don’t tell us much about choice. After those exclusions, I had a list of 9,517 words, of which 2,212 entered the language before 1150 and 7,125 after 1149. (The list is available here.)

Finally, I doubt we’ll be so lucky — but if you do cite this blog post, it should be cited as a collective work by Ted Underwood and Jordan Sellers, because the nineteenth-century part of the underlying collection is a product of Jordan’s research.

References
Laly Bar-Ilan and Ruth A. Berman, “Developing register differentiation: the Latinate-Germanic divide in English,” Linguistics 45 (2007): 1-35.

[UPDATE March 13, 2011: Twitter conversation about this post with Natalia Cecire.]