Open data

I’m going to use this page to store links to data and code that support certain blog posts.

Literary and nonliterary diction, the sequel.

Here’s very ugly R code that I used in the post. A prettier, commented version will be forthcoming over the weekend.
RepeatSamplingOfCorpus.R

Data and code for “differentiation of literary diction.”

The eighteenth-century part of the collection I’m using was drawn mostly from TCP-ECCO. The nineteenth-century part was developed by Jordan Sellers over the course of the last year. Jordan selected the volumes from WorldCat that seemed to have sold most widely in each decade (based on the number of early editions surviving in libraries). He also made a special effort to represent the diversity of genres and subject categories in each decade. Then we ran those volumes through a Python script that corrected their OCR, using specially developed dictionaries and rulesets sensitive to nineteenth-century usage. Finally, we tokenized the volumes using a ruleset that normalized orthography and hyphenation consistently across the whole period 1700-1900. [Update March 3, 2012: I should also acknowledge that Jordan's work was funded by the Andrew W. Mellon Foundation.]

Jordan and TCP certainly have different selection practices, and it is possible that this could produce some differences between the 18c and the 19c parts of the collection. But I have trouble imagining how it would produce a shift in the relationship of fiction to nonfiction that begins in the 18c and continues through the 19c.

If you’re looking for an immaculate collection, you want a best-of album by Madonna, and not this dataset. In doing OCR correction we aimed at 95% accuracy, emphasizing “precision” over “recall.” I think we achieved much better than 95% precision, and the data is great for the kind of statistical questions that interest me — but if you’re curating editions for posterity, this won’t be any use to you. The precision differs between the TCP collection and the part we constructed, which matters for absolute frequency measurements — but again, it’s hard to see how that would affect the relative similarity of genres.

The basic metadata for the collection is here:
DocMetadata.txt: Collection metadata
Start here if you want to know what texts I’m basing this on.

Here is the collection itself as
JointCorpus.txt: A sparse table of word frequencies.
It’s a zipped document of 108MB. When you unzip it will be about 1GB. It’s a tab-separated file with six columns: segment, docid, word, # of occurrences for that word, year, genre, collection. The “docid” column is keyed to the metadata file I have provided above. Unfortunately I have divided long documents into “segments” of 50-100k words for certain purposes. So if you want to turn this into a list of word frequencies by document you will need to sum the # of occurrences by docid and word.

I can also provide a zipped version of individual text files if you want to develop your own system for tokenizing and normalizing them. If you’re interested, I can also share the rulesets I use to do that.

For the purposes of this blog post, I also produced a version of all poetry files with the prose introductions and notes taken out. That makes a difference, especially in the late 18c/early 19c where poetry volumes have voluminous prose notes and are rather similar to prose unless you select verse only. I haven’t yet integrated these in the main collection; they represent “alternate versions” of the poetry documents that were used in this post, and the docid in each case is distinguished by a “v” on the end.
VerseOnly.txt: Verse-only versions of poetry.
This is a tab-separated file with the same format that I used in JointCorpus.txt above.

Broadly the way I work is to store data in MySQL as a sparse table, and unpack parts of it into R as needed to answer specific questions. Here is the R script I used to identify words that seemed to be responsible for the differentiation of genres:

DiachronicCorpusComparison.R

More code to come.

12 thoughts on “Open data

  1. Let me evangelize on this here because I only fully drank the cool-aid after several years of using the language, and it makes everything (particularly ggplot2) so much nicer: code in R is far better vectorized; which in practice, means never using a for-loop, a counter variable,or pre-filling an array before populating it. So for instance:

    Corpus1 <- array(data = 0, dim = c(Size1, WordsLen))
    i = 0
    for (Member in Members1) {
    i = i + 1
    if (Member %in% OldMembers1) {
    Index <- which(OldMembers1 == Member)
    Corpus1[i, ] <- OldCorpus1[Index, ]
    }
    }

    could be (with NAs instead of 0s, which can be changed):

    Corpus1 = sapply(Members1,function(Members) {
    OldCorpus1[which(OldMembers1==Member),]
    })

    Or even better, using match():

    Corpus1 = OldCorpus1[match(Members1,OldMembers1),]

    (not tested)

    • Thanks, Ben. That “match” function is nice! I know the line about R wanting to vectorize everything, I just often forget how to use sapply to solve a specific problem. But I appreciate the evangelism and promise to be born again!

  2. I’m really interested in this work and appreciate your openness about your data & process. Could you say a bit more about the selection process for the 19thc volumes? Was the genre selection (and determination, which isn’t always as clear-cut as one might imagine) done before or after selecting those texts most represented in libraries?
    Also, how you were measuring “the number of early editions surviving in libraries” — i.e., does 4 copies of Book X at Library 1 equal 1 copy of Book X at Libraries 2, 3, 4, 5? or are those weighted differently? (Intuitively I’d say that books with one pattern and books of the other are placed differently in the canon — at least for 19thc poetry — but I don’t know what the data would show)

    • You are so right about genre determination being a mess. Especially in the 18c, where books are often actively trying to trick you. If they say they’re letters — they’re not letters. If they say they’re “a true history” — it’s fiction.

      Jordan and I did genre determination by hand. In the 18c part, I tagged things with genre IDs after TCP-ECCO had selected them. In the 19c part, Jordan tried to select things to keep genres more or less in balance. So genre determination was done before selection there. But we do also have a “miscellaneous” category that allows us to include texts that are indeterminate, either in the 18c or the 19c.

      But my feeling is that the problem here is less selection criteria than the actual fuzziness of generic categories. In the 18c it is often very hard to say what’s fiction, what’s a fictionalized biography, and what’s a biography. This is in one sense a problem for my analysis, but in another sense a piece of corroborating evidence. It’s not surprising that generic diction hadn’t differentiated yet when the genres themselves were hardly differentiated!

    • To flesh out some of the “fine points” of the collection process, I probably need to overview the two 19c archives at work in the background. At present, the 19c corpus is a subset of a broader archive of roughly 4000 individual volumes. So there are two levels of selection guiding the collection development: one for archiving and one for processing.

      Archival Phase:

      Genre diversity was our base target for the broader archive, so I tried a few different methods for obtaining the wide survey. At first, I experimented with Library of Congress Classification browsing on a year by year basis in WorldCat’s advanced search. For example, an 1800 “year” parameter and an LCC descriptor will return the top titles for any given LCC. That method produced some fascinating titles, but in practice, it was a serious time drain. WorldCat can produce roughly the same results if you define minimal parameters. I defined publisher location and year, which returns a book titles list for “everything” sorted by “worldwide libraries” who own at least one copy. WorldCat’s sorting algorithm does not take into account libraries that have multiple copies, so “the number of early editions surviving in libraries” is really just a total of libraries known to have a surviving original.

      WorldCat seems to be the best resource available to generate a sorted list of titles while simultaneously returning all possible hits. On the down side, WorldCat disproportionately represents US libraries. According to WorldCat’s registry there are 6013 member libraries in the UK. 1222 are Academic libraries; 11 are state/national libraries; 4208 public libraries. I am a bit unclear if that means all the libraries are actively contributing records or not. My guess is probably a percentage of those libraries have holdings listed. In comparison there are whopping 78132 US libraries listed in the registry as members. Of those, 9894 are listed as Academic libraries; 186 are state/national libraries; 18487 are public libraries. There is the US bias by the numbers. WorldCat makes that information available here (using advanced search):
      http://www.worldcat.org/registry/Institutions.

      Not all libraries in the US or UK are included in those numbers, so you figure that you lose a certain part of your sample there.  WorldCat is not a particularly accurate marker of best selling books in the year X, but it provides numbers that are hard to find efficiently without going to publisher lists or finding print run info.  The kinds of volumes that I am collecting also present a problem in terms of record entry that makes it very hard to quantify the number of extant copies.  There are plenty of non-inventoried collections of early 19c books in archives across the US.  Some of those collections have not yet made the move to digital records that WorldCat would catch.  I also ran into some problems with the catalog records that have been entered online.  Many academic libraries have relatively clean online records.  But you can count on numerous wildcards: duplicate records at libraries, microfilm that has been entered as a book, and so on.  All that noted, WorldCat does give quick, efficient results. 

      In my experience, the main utility of the “worldwide libraries” number is that you get the bounty of asking for all titles, regardless of genre, plus some sorting. There are other resources available that do similar work to WorldCat.  A good example is Karlsruhe Virtual Catalog:  http://www.ubka.uni-karlsruhe.de/kvk_en.html. This engine is run out of Germany and can combine WorldCat results with British Libraries. The down-side to Karlsruhe is you lose the ability to sort based on number of libraries or number of worldwide copies.

      For each year 1800-1899, I surveyed the top book titles returned for London, Edinburgh, Dublin, Boston, New York, and Philadelphia. At this point, you also have to deal with the problem of balancing national origin: UK vs. US. On the UK side, I selected every available title in the top 50 results for London, the top 10 results for Edinburgh, and the top 5 for Dublin. For the American titles, I usually selected every available title in the top 10 for Philadelphia, Boston, and New York respectively. Those ratios are an arbitrary rubric on my part. However, the “worldwide libraries” numbers for London, New York, Boston, and Philadelphia tend to level out beyond that point, which makes it much harder to balance the archive.

      Processing Phase:

      For the smaller 19c corpus of corrected texts, I selected roughly 10 titles per year (100 titles per decade) to process with the OCR corrector. This second selection happened one decade at a time. If you select the 100 titles with the most worldwide libraries in each decade, nonfiction will overrun the sample. With that in mind, the smaller 19c corpus strikes a balance between the diverse genre survey and trends that emerge in the WorldCat results. For example, if the number travel literature titles spikes in the results for a decade, I tried to mirror that trend in the sub-collection. As Ted suggested, the rubric for deciding which texts to correct first fluctuates based on decade to decade trends in the WorldCat results. The long term goal is to vacuum up everything left in the broader archive.

  3. Thanks, Jordan, for explaining in more detail about your process. I’m currently using WorldCat as one of my sources as I collect bibliographic metadata for English poetry published 1840-1900, so I’m always interested in how other people are doing related kinds of work. I’m very interested in the results you and Ted are coming up with but also in the methodological and practical implications of the process.
    Your clarification about how you’re using the numbers of libraries holding a particular edition (as opposed to copies of editions in libraries) is an important one. And this approach makes sense for the big-picture, multi-genre kind of corpus you’re collecting. More detailed comparisons would only be possible within a smaller set of records (and might be something I’ll look into a bit in the data I’ve been gathering). Overall, I think it’s important to stress that numbers of libraries holding a text doesn’t necessarily mean texts that sold most widely (something Ted implied), given the vagaries of library acquisitions and research canon formation. These numbers tell us about the 19thc textual record as it is constituted in modern-day library holdings. (Which, as Andrew Stauffer’s been demonstrating, is under distinct pressures that may in time change our view of that textual record.)

    • Re: the imperfection of WorldCat numbers: Absolutely.

      It would be tremendously valuable if people with a background in library science would put together a broad list of, say, the 5,000 titles that actually *did* reach the broadest readership in Britain (and/or America) in the 18c, 19c, and so on. Lists like that would get used heavily. It sounds like you’re contributing toward that goal.

  4. Pingback: Giving It Away « historying

  5. I know this isn’t the right place on your blog to inquire about this, but do you have an RSS feed or any other means (aside from WordPress email subscription or twitter) for someone to follow this blog?

    • It’s a good question. I don’t use RSS myself yet (for no reason other than being slow to get started), so I’ve been lazy about figuring out how it would interact with my blog. But I’m finally making an effort, and I *think* I’ve just enabled an RSS feed in the sidebar. Let me know if it’s not working optimally; I may need to keep fiddling.

      Thanks for asking,

      Ted

  6. Pingback: The Emergence of Literary Diction « archaeoinaction.info

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s