I’m going to use this page to store links to data and code that support certain blog posts.
Literary and nonliterary diction, the sequel.
Here’s very ugly R code that I used in the post. A prettier, commented version will be forthcoming over the weekend.
Data and code for “differentiation of literary diction.”
The eighteenth-century part of the collection I’m using was drawn mostly from TCP-ECCO. The nineteenth-century part was developed by Jordan Sellers over the course of the last year. Jordan selected the volumes from WorldCat that seemed to have sold most widely in each decade (based on the number of early editions surviving in libraries). He also made a special effort to represent the diversity of genres and subject categories in each decade. Then we ran those volumes through a Python script that corrected their OCR, using specially developed dictionaries and rulesets sensitive to nineteenth-century usage. Finally, we tokenized the volumes using a ruleset that normalized orthography and hyphenation consistently across the whole period 1700-1900. [Update March 3, 2012: I should also acknowledge that Jordan's work was funded by the Andrew W. Mellon Foundation.]
Jordan and TCP certainly have different selection practices, and it is possible that this could produce some differences between the 18c and the 19c parts of the collection. But I have trouble imagining how it would produce a shift in the relationship of fiction to nonfiction that begins in the 18c and continues through the 19c.
If you’re looking for an immaculate collection, you want a best-of album by Madonna, and not this dataset. In doing OCR correction we aimed at 95% accuracy, emphasizing “precision” over “recall.” I think we achieved much better than 95% precision, and the data is great for the kind of statistical questions that interest me — but if you’re curating editions for posterity, this won’t be any use to you. The precision differs between the TCP collection and the part we constructed, which matters for absolute frequency measurements — but again, it’s hard to see how that would affect the relative similarity of genres.
The basic metadata for the collection is here:
DocMetadata.txt: Collection metadata
Start here if you want to know what texts I’m basing this on.
Here is the collection itself as
JointCorpus.txt: A sparse table of word frequencies.
It’s a zipped document of 108MB. When you unzip it will be about 1GB. It’s a tab-separated file with six columns: segment, docid, word, # of occurrences for that word, year, genre, collection. The “docid” column is keyed to the metadata file I have provided above. Unfortunately I have divided long documents into “segments” of 50-100k words for certain purposes. So if you want to turn this into a list of word frequencies by document you will need to sum the # of occurrences by docid and word.
I can also provide a zipped version of individual text files if you want to develop your own system for tokenizing and normalizing them. If you’re interested, I can also share the rulesets I use to do that.
For the purposes of this blog post, I also produced a version of all poetry files with the prose introductions and notes taken out. That makes a difference, especially in the late 18c/early 19c where poetry volumes have voluminous prose notes and are rather similar to prose unless you select verse only. I haven’t yet integrated these in the main collection; they represent “alternate versions” of the poetry documents that were used in this post, and the docid in each case is distinguished by a “v” on the end.
VerseOnly.txt: Verse-only versions of poetry.
This is a tab-separated file with the same format that I used in JointCorpus.txt above.
Broadly the way I work is to store data in MySQL as a sparse table, and unpack parts of it into R as needed to answer specific questions. Here is the R script I used to identify words that seemed to be responsible for the differentiation of genres:
More code to come.