People new to text mining are often disillusioned when they figure out how it’s actually done — which is still, in large part, by counting words. They’re willing to believe that computers have developed some clever strategy for finding patterns in language — but think “surely it’s something better than that?”
Uneasiness with mere word-counting remains strong even in researchers familiar with statistical methods, and makes us search restlessly for something better than “words” on which to apply them. Maybe if we stemmed words to make them more like concepts? Or parsed sentences? In my case, this impulse made me spend a lot of time mining two- and three-word phrases. Nothing wrong with any of that. These are all good ideas, but they may not be quite as essential as we imagine.
I suspect the core problem is that most of us learned language a long time ago, and have forgotten how much leverage it provides. We can still recognize that syntax might be worthy of analysis — because it’s elusive enough to be interesting. But the basic phenomenon of the “word” seems embarrassingly crude.We need to remember that words are actually features of a very, very high-level kind. As a thought experiment, I find it useful to compare text mining to image processing. Take the picture on the right. It’s pretty hard to teach a computer to recognize that this is a picture that contains a face. To recognize that it contains “sitting” and a “baby” would be extraordinarily impressive. And it’s probably, at present, impossible to figure out that it contains a “blanket.”
Working with text is like working with a video where every element of every frame has already been tagged, not only with nouns but with attributes and actions. If we actually had those tags on an actual video collection, I think we’d recognize it as an enormously valuable archive. The opportunities for statistical analysis are obvious! We have trouble recognizing the same opportunities when they present themselves in text, because we take the strengths of text for granted and only notice what gets lost in the analysis. So we ignore all those free tags on every page and ask ourselves, “How will we know which tags are connected? And how will we know which clauses are subjunctive?”
Natural language processing is going to be important for all kinds of reasons — among them, it can eventually tell us which clauses are subjunctive (should we wish to know). But I think it’s a mistake to imagine that text mining is now in a sort of crude infancy, whose real possibilities will only be revealed after NLP matures. Wordcounts are amazing! An enormous amount of our cultural history is already tagged, in a detailed way that is also easy to analyze statistically. That’s not an embarrassingly babyish method: it’s a huge and obvious research opportunity.