Tech note

I’m using this as a temporary holding area for technical details that would otherwise clutter up the explanations on the main blog. That way they’re available for people who are interested, without boring people who aren’t.

The corpus-level topic modeling (“topic tree”) I described on April 4th was written in R, using a vector space model of a peculiar kind that I’m not sure people have used before, which is what makes it worth describing at some length.

Generally, vector space models are used in search engines, to find a document that matches your search query. Each document is represented by a vector; the frequencies of different terms in the document are the components of the vector.

I’m flipping this upside down, so I can use a vector space model to map the relationships between terms rather than between documents. In my model, each term is represented by a vector, and the components of the vector are its frequency in each of the documents in the corpus. For topic modeling across a corpus, this works better than simple Pearson’s correlation, because the “cosine similarity” measure used in a vector space model automatically gives more weight to longer documents, and to documents where a term is very strongly represented.

One additional refinement is necessary to make the model really hum. There are a lot of different ways to define “frequency” in a vector space model. Search engines use tf-idf scores, which works well for search, since you want to give extra weight to rare words when you’re trying to find documents. I find that it’s better not to give that extra weight when you’re interested in the words themselves.

But there are also problems with using raw counts of occurrence or normalized occurrences-per-thousand-words as your measure of “frequency” in a vector space model. What I finally settled on, both because it makes sense statistically and because, in practice, it works like a dream, is assessing frequency as the difference between the expected occurrence of a term and the actual number of occurrences in the document. In other words, if I’ve got 1000 documents of the same length, and term X occurs 1000 times in this corpus, I would normally expect it to occur once in each document. Expected occurrence is

(document length / corpus length) * total occurrences of the term

So the total formula for the frequency of term X in document Y is defined as:

occurrences of X in Y – ((Y length/corpus length) * total occurrences of X in corpus)

This means that some components of the vector are negative, which is actually important. Otherwise the fact that a word doesn’t occur in a book of 100,000 words would have the same weight as the fact that it doesn’t occur in a play of 15,000 words, because they would both “bottom out” at zero.

When I say that this technique “works like a dream,” what I mean is that it produces a very useful measure of association. When you take a word like “ancient/antient” and use this vector space model to find words associated with it, you get a list like this:

word cosine-similarity
ancient 0.7611
antiquity 0.7560
monuments 0.6187
centuries 0.5939
modern 0.5505
origin 0.5449
century 0.5440
antient 0.5421
tradition 0.5363
ages 0.5348

Most of the other techniques I tried produced either a list of very rare words like “Diocletian” (true, it appears in the same books as “ancient,” but …) or a list of very common words like “did” (which also appears in the same books, but …)

Of course, the “topic trees” produced by this measure are only as good as the lists of words you feed into them. I started out using a selective list of 500 words, but I found that the results got more interesting when I expanded to the 1350, and then 1650, most common words in the ECCO collection. Why 1350 and 1650, rather than 1500? No good reason. It’s a computationally expensive process, so I’ve been expanding the size of the thing gradually to avoid crashes. [Editor’s note, June 29th: Actually, it wasn’t computationally expensive. I was just being hesitant back in April. This process turns out to be pretty robust and easy to scale up.] The illustrations in the April 4th post were produced at different stages of that expansion, so they can’t easily be compared to each other. The 1650-word tree I posted on April 7th is more coherent. I excluded roughly 150 stopwords, especially pronouns, conjunctions, and prepositions, as well as most auxiliary verbs, abbreviations, and personal names. Those choices are worth close examination: given the way the clustering process works, initial conditions can end up making a big difference. But I also have to say that I’m struck by how stable the structure turns out to be: I’ve seen four versions of the tree now, based on very different initial sets of words, and much of the architecture remained constant in all four versions.

It may be controversial whether or not to call this “topic modeling.” If you want to describe the internal structure of a literary work, this technique of course won’t do the job directly, because it doesn’t divide works into parts. But I think it does a pretty good job of identifying the implicit thematic structure of eighteenth-century discourse as a whole, and I wouldn’t be surprised if it turned out that the internal structure of individual works is defined in large part by the way these corpus-level topics weave in and out of them.

The underlying 18c texts are stored as frequency tables in MySQL. Spelling was normalized to modern British spelling, although a few things may still need correction. I didn’t change 18c syncope in words like “o’er,” but I did normalize the spelling of past tenses: “inspir’d” became “inspired.”

6 replies on “Tech note”

This is really cool. I’m impressed by neatly the high-level topic trees are separated from each other. Dropping TF-IDF seems like a good idea to me, too—it seems like humanists are often quite interested in words that appear in almost all documents but at varying rates, which TF-IDF tends to deprecate.

So you’re using a numeric count of over-occurrence (eg., “ancient” appears in Tom Jones 20 fewer times than expected)–the worry I’ve had with that is that more frequent words have higher scores, but as I write this I’m realizing that doesn’t matter here because the vectors should still point in the same direction. I spent a while fooling around with an expectations model based around standard deviations to wipe out size effects, but this seems a lot cleaner. Very clever. I think I get it now.

Thanks! I’m glad you like it, because, as you know, the whole idea was based very heavily on your approach. If I understand the math correctly, words of greater (overall) frequency pose no problem for this model. But I think the cosine-similarity measure in a vector space model does have a tendency to give greater weight to *documents* that are outliers. I.e., I suspect that

“10 occurrences more than expected in Tom Jones +
10 occurrences more than expected in Pamela”

is not going to be weighted quite as heavily as

“20 occurrences more than expected in Tom Jones.”

But that differential weighting may be exactly what we want for these purposes.

Here are links to the R modules that actually produce the clustering and tree map. This doesn’t include the Python scripts I used to convert the ECCO documents into frequency tables in the first place. But if you’re working with a different dataset, that part wouldn’t be important:

This is the main clustering/dendrogram builder:

http://dl.dropbox.com/u/4713959/ECCOtree.R

Here is a trivial little script that I used to cut the giant dendrogram down into “branches”:
http://dl.dropbox.com/u/4713959/TreeChopper.R

Sorry that I’m not sharing this in a more elegant way. Don’t have a github account … yet.

Is there any way that i can use both results from Pearson and Vector Space Model that can be able to generate some recommend value or percentage , i other word like hybrid results ? Thanks.