Wordcounts are amazing.

People new to text mining are often disillusioned when they figure out how it’s actually done — which is still, in large part, by counting words. They’re willing to believe that computers have developed some clever strategy for finding patterns in language — but think “surely it’s something better than that?

Uneasiness with mere word-counting remains strong even in researchers familiar with statistical methods, and makes us search restlessly for something better than “words” on which to apply them. Maybe if we stemmed words to make them more like concepts? Or parsed sentences? In my case, this impulse made me spend a lot of time mining two- and three-word phrases. Nothing wrong with any of that. These are all good ideas, but they may not be quite as essential as we imagine.

I suspect the core problem is that most of us learned language a long time ago, and have forgotten how much leverage it provides. We can still recognize that syntax might be worthy of analysis — because it’s elusive enough to be interesting. But the basic phenomenon of the “word” seems embarrassingly crude.

Billy Graham, 1949, from the Galt Museum, on Creative Commons.
Baby, 1949, from the Galt Museum, on Creative Commons.
We need to remember that words are actually features of a very, very high-level kind. As a thought experiment, I find it useful to compare text mining to image processing. Take the picture on the right. It’s pretty hard to teach a computer to recognize that this is a picture that contains a face. To recognize that it contains “sitting” and a “baby” would be extraordinarily impressive. And it’s probably, at present, impossible to figure out that it contains a “blanket.”

Working with text is like working with a video where every element of every frame has already been tagged, not only with nouns but with attributes and actions. If we actually had those tags on an actual video collection, I think we’d recognize it as an enormously valuable archive. The opportunities for statistical analysis are obvious! We have trouble recognizing the same opportunities when they present themselves in text, because we take the strengths of text for granted and only notice what gets lost in the analysis. So we ignore all those free tags on every page and ask ourselves, “How will we know which tags are connected? And how will we know which clauses are subjunctive?”

Natural language processing is going to be important for all kinds of reasons — among them, it can eventually tell us which clauses are subjunctive (should we wish to know). But I think it’s a mistake to imagine that text mining is now in a sort of crude infancy, whose real possibilities will only be revealed after NLP matures. Wordcounts are amazing! An enormous amount of our cultural history is already tagged, in a detailed way that is also easy to analyze statistically. That’s not an embarrassingly babyish method: it’s a huge and obvious research opportunity.

By tedunderwood

Ted Underwood is Professor of Information Sciences and English at the University of Illinois, Urbana-Champaign. On Twitter he is @Ted_Underwood.

11 replies on “Wordcounts are amazing.”

Thanks for that link. I hadn’t seen that, but it’s exactly what I meant. And I was generally thinking about the problem of “deep learning,” which seems to involve successive layers of abstraction — first to produce high-level features, and then to learn patterns of which those features are elements. Working with text, a large part of that job has been done for us.

I’ve thought a lot about how people with various backgrounds think about language.

There was a significant realization in my thinking about topic analysis (which I didn’t think about in a serious way until your long PMLA piece) when I realized that it treated a document as a bag of words. That should be pretty obvious to anyone who’s actually done it, but it was nowhere explicitly stated in any of the “intro to topic analysis for humanists” articles that I read. I didn’t find an explicit statement about that until I read the technical review article by the guy at Princeton whose name I forget. The point, of course, is that the technique involves comparing the contents of 100s and 1000s of bags of words. It’s easy enough to state that, but how do you lead someone to think that through with some care?

And I thought that Mathew Jockers’ very literary account was almost useless; the literary charm just got in the way.

Think about what texts mean, and how slippery meaning is, that’s what lit crit trains you to do. And it’s worlds away from thinking about the brute mechanisms and substance of language. When you’re thinking about meaning you’re always passing though or over the words on the way to meaning. You don’t really think about the language itself, not even if you’re a canny post-structuralist who takes pride in skepticism about language.

So, yes, by all means, word counts. How do words actually get into texts?

“Pennebaker admits that word-counting programs are “remarkably stupid,” unable to recognize irony, sarcasm or even the basic contextual clues that allow us to distinguish which meaning of a word is intended. Yet these “stupid” programs have led to a series of unexpected findings ever since Pennebaker first saw the need for one 20 years ago. At the time, he and his graduate students were working through thousands of diary entries written by people suffering from depression, analyzing how people deal with traumatic moments. Writing about trauma seemed to help some people, but why? To answer the question, his team created a program to read the diary entries automatically and count words related to different psychological states, like anger, sadness and more positive emotions.”

[…] [1] Ted Underwood points out that while word counts are simplistic, they are still extremely powerful. The full richness of words themselves, he argues, are still not a fully utilized feature for machine learning algorithms. In the comments Ryan Shaw points to another blog post by Brendan O’Conner which succinctly and brilliantly observes: “Words are already a massive dimension reduction of the space of human experiences.” […]