18c 19c 20c collection-building fiction representativeness

Distant reading and the blurry edges of genre.

There are basically two different ways to build collections for distant reading. You can build up collections of specific genres, selecting volumes that you know belong to them. Or you can take an entire digital library as your base collection, and subdivide it by genre.

Most people do it the first way, and having just spent two years learning to do it the second way, I’d like to admit that they’re right. There’s a lot of overhead involved in mining a library. The problem becomes too big for your desktop; you have to schedule batch jobs; you have to learn to interpret MARC records. All this may be necessary eventually, but it’s not the ideal place to start.

But some of the problems I’ve encountered have been interesting. In particular, the problem of “dividing a library by genre” has made me realize that literary studies is constituted by exclusions that are a bit larger and more arbitrary than I used to think.

First of all, why is dividing by genre even a problem? Well, most machine-readable catalog records don’t say much about genre, and even if they did, a single volume usually contains multiple genres anyway. (Think introductions, indexes, collected poems and plays, etc.) With support from the ACLS and NEH, I’ve spent the last year wrestling with that problem, and in a couple of weeks I’m going to share an imperfect page-level map of genre for English-language books in HathiTrust 1700-1923.

But the bigger thing I want to report is that the ambiguity of genre may run deeper than most scholars who aren’t librarians currently imagine. To be sure, we know that subgenres like “detective fiction” are social institutions rather than natural forms. And in a vague way we also accept that broader categories like “fiction” and “poetry” are social constructs with blurry edges. We can all point to a few anomalies: prose poems, eighteenth-century journalistic fictions like The Spectator, and so on.

But somehow, in spite of knowing this for twenty years, I never grasped the full scale of the problem. For instance, I knew the boundary between fiction and nonfiction was blurry in the 18c, but I thought it had stabilized over time. By the time you got to the Victorians, surely, you could draw a circle around “fiction.” Exceptions would just prove the rule.

Selecting volumes one by one for genre-specific collections didn’t shake my confidence. But if you start with a whole library and try to winnow it down, you’re forced to consider a lot of things you would otherwise never look at. I’ve become convinced that the subset of genre-typical cases (should we call them cis-genred volumes?) is nowhere near as paradigmatic as literary scholars like to imagine. A substantial proportion of the books in a library don’t fit those models.

This is both a photograph of a real, unnamed mother and baby, and a picture of a fictional character named Shinkah. Frontispiece to Shinkah, The Osage Indian (1916).
This is both a photograph of a real, unnamed mother and baby, and a picture of a fictional character named Shinkah. Frontispiece to Shinkah, The Osage Indian (1916).

Consider the case of Shinkah, the Osage Indian, published in 1916 by S. M. Barrett. The preface to this volume informs us that it’s intended as a contribution to “the sociology of the Osage Indians.” But it’s set a hundred years in the past, and the central character Shinkah is entirely fictional (his name just means “child.”) On the other hand, the book is illustrated with photographs of real contemporary people, who stand for the characters in an ethnotypical way.

After wading though 872,000 volumes, I’m sorry to report that odd cases of this kind are more typical of nineteenth- and early twentieth-century fiction than my graduate-school training had led me to believe. There’s a smooth continuum for instance between Shinkah and Old Court Life in France (1873), by Frances Elliot. This book has a bibliography, and a historiographical preface, but otherwise reads like a historical novel, complete with invented dialogue. I’m not sure how to distinguish it from other historical novels with real historical personages as characters.

Literary critics know there’s a problem with historical fiction. We also know about the blurry boundary between fiction, journalism, and travel writing represented by the genre of the “sketch.” And anyone who remembers James Frey being kicked out of Oprah Winfrey’s definition of nonfiction knows that autobiographies can be problematic. And we know that didactic fiction blurs into philosophical dialogue. And anyone who studies children’s literature knows that the boundary between fiction and nonfiction gets especially blurry there. And probably some of us know about ethnographic novels like Shinkah. But I’m not sure many of us (except for librarians) have added it all up. When you’re sorting through an entire library you’re forced to see the scale of it: in the period 1700-1923, maybe 10% of the volumes that could be cataloged as fiction present puzzling boundary cases.

You run into a lot of these works even if you browse or select titles at random; that’s how I met Shinkah. But I’ve also been training probabilistic models of genre that report, among other things, how certain or uncertain they are about each page. These models are good at identifying clear cases of our received categories; I found that they agreed with my research assistants almost exactly as often as the research assistants agreed with each other (93-94% of the time, about broad categories like fiction/nonfiction). But you can also ask a model to sift through several thousand volumes looking for hard cases. When I did that I was taken aback to discover that about half the volumes it had most trouble with were things I also found impossible to classify. The model was most uncertain, for instance, about The Terrific Register (1825) — an almanac that mixes historical anecdote, urban legend, and outright fiction randomly from page to page. The second-most puzzling book was Madagascar, or Robert Drury’s Journal (1729), a book that offers itself as a travel journal by a real person, and was for a long time accepted as one, although scholars have more recently argued that it was written by Defoe.

Of course, a statistical model of fiction doesn’t care whether things “really happened”; it pays attention mostly to word frequency. Past-tense verbs of speech, personal names, and “the,” for instance, are disproportionately common in fiction. “Is” and “also” and “mr” (and a few hundred other words) are common in nonfiction. Human readers probably think about genre in a more abstract way. But it’s not particularly miraculous that a model using word frequencies should be confused by the same examples we find confusing. The model was trained, after all, on examples tagged by human beings; the whole point of doing that was to reproduce as much as possible the contours of the boundary that separates genres for us. The only thing that’s surprising is that trawling the model through a library turns up more books right in the middle of the boundary region than our habits of literary attention would have suggested.

A lot of discussions of distant reading have imagined it as a move from canonical to popular or obscure examples of a (known) genre. But reconsidering our definitions of the genres we’re looking for may be just as important. We may come to recognize that “the novel” and “the lyric poem” have always been islands floating in a sea of other texts, widely read but never genre-typical enough to be replicated on English syllabi.

In the long run, this may require us to balance two kinds of inclusiveness. We already know that digital libraries exclude a lot. Allen Riddell has nicely demonstrated just how much: he concludes that there are digital scans for only about 58% of the novels listed in bibliographies as having been published between 1800 and 1836.

One way to ensure inclusion might be to start with those bibliographies, which highlight books invisible in digital libraries. On the other hand, bibliographies also make certain things invisible. The Terrific Register (1825), for instance, is not in Garside’s bibliography of early-nineteenth-century fiction. Neither is The Wonder-Working Water Mill (1791), to mention another odd thing I bumped into. These aren’t oversights; Garside et. al. acknowledge that they’re excluding certain categories of fiction from their conception of the novel. But because we’re trained to think about novels, the scale of that exclusion may only become visible after you spend some time trawling a library catalog.

I don’t want to present this as an aporia that makes it impossible to know where to start. It’s not. Most people attempting distant reading are already starting in the right place — which is to build up medium-sized collections of familiar generic categories like “the novel.” The boundaries of those categories may be blurrier than we usually acknowledge. But there’s also such a thing as fretting excessively about the synchronic representativeness of your sample. A lot of the interesting questions in distant reading are actually trends that involve relative, diachronic differences in the collection. Subtle differences of synchronic coverage may more or less drop out of questions about change over time.

On the other hand, if I’m right that the gray areas between (for instance) fiction and nonfiction are bigger and more persistently blurry than literary scholarship usually mentions, that’s probably in the long run an issue we should consider! When I release a page-level map of genre in a couple of weeks, I’m going to try to provide some dials that allow researchers to make more explicit choices about degrees of inclusion or exclusion.

Predictive models that report probabilities give us a natural way to handle this, because they allow us to characterize every boundary as a gradient, and explicitly acknowledge our compromises (for instance, trade-offs between precision and recall). People who haven’t done much statistical modeling often imagine that numbers will give humanists spuriously clear definitions of fuzzy concepts. My experience has been the opposite: I think our received disciplinary practices often make categories seem self-evident and stable because they teach us to focus on easy cases. Attempting to model those categories explicitly, on a large scale, can force you to acknowledge the real instability of the boundaries involved.

References and acknowledgments

Training data for this project was produced by Shawn Ballard, Jonathan Cheng, Lea Potter, Nicole Moore and Clara Mount, as well as me. Michael L. Black and Boris Capitanu built a GUI that helped us tag volumes at the page level. Material support was provided by the National Endowment for the Humanities and the American Council of Learned Societies. Some information about results and methods is online as a paper and a poster, but much more will be forthcoming in the next month or so — along with a page-level map of broad genre categories and types of paratext.

The project would have been impossible without help from HathiTrust and HathiTrust Research Center. I’ve also been taught to read MARC records by librarians and information scientists including Tim Cole, M. J. Han, Colleen Fallaw, and Jacob Jett, any of whom could teach a course on “Cursed Metadata in Theory and Practice.”

I mention Garside’s bibliography of early nineteenth-century fiction. This is Garside, Peter, and Rainer Schöwerling. The English novel, 1770-1829 : a bibliographical survey of prose fiction published in the British Isles. Ed. Peter Garside, James Raven, and Rainer Schöwerling. 2 vols. Oxford: Oxford University Press, 2000.

Paul Fyfe directed me to a couple of useful works on the genre of the sketch. Michael Widner has recently written a dissertation about the cognitive dimension of genre titled Genre Trouble. I’ve also tuned into ongoing thoughts about the temporal and social dimensions of genre from Daniel Allington and Michael Witmore. The now-classic pamphlet #1 from the Stanford Literary Lab, “Quantitative Formalism,” is probably responsible for my interest in the topic.

By tedunderwood

Ted Underwood is Professor of Information Sciences and English at the University of Illinois, Urbana-Champaign. On Twitter he is @Ted_Underwood.

19 replies on “Distant reading and the blurry edges of genre.”

Fascinating. What I read you as saying is that your statistical model generally manages to identify the texts that a human would confidently assign to a genre (albeit by using what appear to be quite different criteria), and that where it fails, this is in many cases because it is not possible to decide where genres begin and end on a conceptual level. So you begin with what appears to be an empirical question, but end up with a problem in the philosophy of literature.

This rather reminds me of how I first got excited about and then became frustrated by the potential of prototypes theory as an approach to genre. It makes it very tempting for the researcher to say, ‘Well of course the boundaries are fuzzy – they always are – it’s the unarguability of the easy cases that matters.’ But what you’re emphasising here is that problems of the boundaries between categories can’t be brushed aside like that.

Which in turn reminds me of a lot that goes on in social science research methods training, and in arguments between qualitative and quantitative social science researchers – both of whom deal with these problems constantly, although in different ways. Having to count things often brings up hard questions about what those ‘things’ are.

Hi Daniel. I think this is a great summary of what’s puzzling about genre.

If this were a longer more formal piece, one of the things I’d probably add is some hesitation about the word “genre” itself, because I suspect it covers a bunch of different entities with different degrees of coherence. Really broad genres like “fiction” and “drama” are stable enough that there is in practice a lot of social consensus at the center of the frame, though the edges are always blurry. As you get down into subgenres, the degree of blurriness increases so much that I think it’s almost a different kind of entity.

That variation makes the tension you’re discussing even trickier to describe, I think, because the things we might call “errors” if categories were clear become “ambiguities” when they’re not. With the (fairly broad, fairly stable) genres I’m discussing here, I would definitely want to admit that my classifiers do make simple errors. There are plenty of cases where human readers agree, and the classifier is just off. But there are also cases where the classifier’s uncertainty parallels human dissensus and, as you say, conceptual ambiguity. Disentangling those two phenomena is tricky.

p.s. – we’re working through some of the same issues at 6dfb ( in thinking about how to tag and categorize nodes and edges in a historical social network.

Let me try this again:
Great post, Ted – I look forward to the page-level map! Wanted to ask a question about methodology of classification here, and how it relates to longstanding scholarly practices of genre identification.
We talk about hybrid genres and the genre system quite a lot in the early modern period. Spenser’s Faerie Queene, for example, is often described as an epic romance, meaning that it participates in the genres of epic and romance without belonging exclusively to either of them. Shakespeare’s The Winter’s Tale is likewise a Romance and a tragicomedy. But making these judgments – judging that a single work falls under multiple generic determinations – doesn’t usually produce the kinds of “grey areas” or “bluriness” that you mention in your post. Early modern scholars (or for that matter, early modern genre theorists themselves) wouldn’t say that the FQ is a blurry epic or *kind of* a romance, much less that it is some percentage epic and some percentage romance. Rather, they (we) would say that it participates in some, but not all, of the conventions of multiple genres. FQ is an European epic, for example, insofar as its knights engage in battle and prepare for the war of Christendom against the Saracens. It is a romance insofar as they are led astray by their love of distressed and often deceptive ladies. It is a romance because they are on horses, an epic because their goal is Kleos, glory, as it was for Achilles. You get the picture. The generic division of the Winter’s Tale is clearer still: it participates primarily in the conventions of tragedy up to the stage direction “Exit pursued by a bear,” and after that participates primarily in the conventions of comedy.
The point for going on like this is that in none of these judgments, which allow for the overlap of multiple genres, is there anything like a gray area or blurriness: there is only further specification, further parsing out of elements of conventional generic belonging. So whence the blurriness in your models? To what extent is blurriness (however precisely or imprecisely measured ) a product of quantitative modelling rather than of identification within the genre system itself? To what extent is it the product of needing to assign a work to “a genre” instead of allowing works to participate in multiple genres, as we readily do in practice.
Fiction and non-fiction are clearly harder cases, since we now take them to be opposed and exclusive rather than complementary categories. But instead of showing that there were blurry cases in the 19th-cm, couldn’t it instead suggest that for the 19th-c they weren’t exclusive categories? Couldn’t a 19th-c work participate in both fiction and ethnography at the same time, in a way that would make us uncomfortable now, but that was no more problematic for its writers and early readers than hybrids like tragicomedy and epic romance were for 16th or 17th-c readers? Even in a recent case like that of James Frey, we may get upset, or we may stop by saying “this is a gray area.” But others get down to the non-blurry (if always fallible) work of sorting: of showing what elements of his memoir belong to fiction, what to non-fiction.

Sure. Hybrid genres make a lot of sense to me. In fact, in the post I call Shinkah an “ethnographic novel,” which is pretty parallel to “epic romance.”

If I were just describing Shinkah I could stop there, but I also want to construct an 80,000-volume corpus that I’ll use to investigate long-nineteenth-century fiction. What’s blurry, properly speaking, is not the generic status of any particular volume, but the boundary of that corpus — in the sense that it could be drawn in lots of different ways.

So I would say you’re right that the concept of blurriness comes — maybe not from quantification as such — but certainly from scale.

Assigning multiple labels to a single work makes enormous sense, and is certainly something I plan to do when I get to the stage of working with the kinds of “genres” that usually interest literary critics — things like mystery and science fiction. You could do it with Shinkah too if you wanted to; there’s no reason it couldn’t be treated both as fiction and as sociological nonfiction.

I am not effin’ surprised by this one bit. And, yes, we need to think about hybrid genres. This is a problem I’ve thought about a lot, though without the data you’ve got available to you. And I’ve thought about it in relation to the biological concept of the species.

There’s a reason why big chunks of the biological world – most especially multicellular animals – fall into neat species categories and why other big chunks of the biological world – lots of single-celled critters – do not. It has to do with inheritance. Among multi-celled animals almost all genetic transfer is vertical, from parents to child. But in the single-celled world there’s lots of horizontal transfer; it’s not all done through cell-division (which is a vertical inheritance process).

Well, the writers of texts do not always discipline themselves to follow one and only one set of models. They pick and choose promiscuously. And so we get hybrids all over the place.

Culture is, if anything, messier than biology, and biology’s plenty messy. Whatever’s going on, Ted, it’s not an artefact of your procedure that’s going to disappear when some one finds the One True Way. Ain’t such thing.

So, what do we want of classification and genre theory, and why?

Here’s a old paper of mine that covers this is some detail, though it’s more about culture in general: Culture as an Evolutionary Arena.

This is very interesting! In Chinese studies we have been quite lucky in that the Chinese Rare Books project organized cataloging librarians in a systematic way that has placed a good amount of generic information into MARC records (at least for texts dating before 1800), which makes macro-generic analysis much easier. The librarians have generally stuck to traditional categories (mostly taken from earlier catalogs that used the “Complete Library of the Four Treasuries” system, which was how bibliographers for the imperial government classified texts), so it is at least very useful for analyzing how bibliographers conceived of genre. There are still a lot of difficulties with, as you say, page-level genre. There is much switching that occurs, particularly in compendia style works that gather works of multiple types together. It is also still difficult to analyze genres that fall outside traditional categories (of which there are many). My instinct tells me that this type of library-wide analysis may be fruitful in Chinese, because the style of language in many pre-modern Chinese texts is VERY dependent on genre. For the moment this remains speculation though, as not enough works have been digitized yet. I look forward to the page-level map!

Very interesting. MARC records are still a little mysterious to me (I often find myself humming the “Ark of the Covenant” theme from Raiders when I work with them).

I know that the MARC records in this period, in HathiTrust, are very unhelpful with genre, but I suspect the story might change if I compared records from multiple libraries or (especially) if I moved forward to a different period. It’s interesting that your area of study has also been able to systematize cataloging practices.

Haha, MARC records can be pretty impenetrable. Have you seen Deborah Byme’s “MARC Manual: Understanding and Using MARC Records”? It is already 16 years old but is still very useful.

Reblogged this on DigIn' the Humanities! and commented:
This is a really fascinating discussion of the kinds of insight distant reading practices can bring to problematic literary boundaries such as the fuzzy concept of “genre.” I have been puzzling myself over how to model the idea of “literary influence.”

I’m so sorry I have waited so long to reply. Thank you for your link to your blog post; it was very helpful. I am still thinking about models of literary influence, esp. in 19th century novels. I am exploring semantic anslysis, mytheme strucutes, and affect analysis currently.,

Ted, fantastic essay. And let me also say at the outset: congratulations, and thank you, for your heroic work in making shape and form out of the Great Hathi Unread. This is a huge service to the whole DH community and we appreciate that.

I really enjoyed your articulation of two kinds of exclusiveness in digital corpora: digital archives exclude known instances of a research-constructed concept (Allen’s stat that they contain only 58% of novelistic production from 1800-36); but conversely, bibliographies exclude the contours of the very concept they construct, thereby defining “novelistic production” by the act of excluding other types of literary production, even those types that lie just outside the boundary. This kind of paradox vaguely reminds me of Agamben’s “homo sacer” — someone included in the law by his exclusion from it (since the law both no longer applies to him and allows anyone to kill him). In effect, our concepts of “literary production” include these border cases by the act of excluding them, and through that exclusion coming to define the types of production (novelistic, poetic, “non-fiction”) we actually study.

I agree that this is problematic, and we should be thinking about these border cases. Not least because I think the “so what?” answer that I’ve heard you make for DH — that it forces us to interrogate our concepts and discover how messy they actually are — seems to reveal itself best in these kinds of problems. Lately I’ve also been thinking that the quantitative sensibility of DH can profitably influence how we think about critical concepts like “literature,” periods or genres. Quantitative thinking allows us to rediscover how deeply embedded types of literary production are within print production more broadly — as you say, islands on a sea.

Ultimately I do think both approaches to corpus construction are necessary. In fact, as far as I understand how your model works, “border cases” are defined as exactly that — a mixed result between known types of cases. Without these known cases and border-case-excluding historiographies, we would be unable to give any meaningful structure at all to print production, relying on ultimately unsatisfying textual dimensions like “the texts that use past tense kind of a lot” as our only ways to differentiate the archive. As with many DH aporia, the choice of where to begin really depends on the project and questions at hand. But your work has allowed us to see the stakes of these choices, and to better understand how they partly create the object they study.


Thanks, Ryan — glad you liked it! The corpus-building work done out at the Stanford Lit Lab has of course been a model for me from the beginning. I agree that different approaches to corpus construction are complementary. There’s not going to be any one right way to draw boundaries, but we have to start by drawing them somewhere. Then we can change them, and see what changes.

I very much hope you’re right that this sort of exploration may end up reshaping our thinking about concepts like period and genre.

Leave a Reply to Resource: Genre, gender and agency analysis using Parts of Speech in Watson Content Analytics. A simple demonstraton. | Digital Humanities Now Cancel reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s