Categories
fiction plot unsupervised methods

Why it’s hard for syuzhet to be right or wrong yet.

I’ve enjoyed following the exchange between Matt Jockers, Annie Swafford, Jacob Eisenstein, and Dan Piepenbring about Jockers’ R package syuzhet — designed to illuminate plot by tracing the “emotional valence” of narration across the course of a novel.

I’ve found this a consistently impressive and informative conversation; it has taught me literally everything I know about “low-pass filters.” But I have no idea who is right or wrong.

More fundamentally, I’m unsure how anyone could be right or wrong here, because as far as I can tell there’s no thesis under discussion yet. Jockers’ article isn’t published. All we have is an R package, syuzhet, which does something I would call exploratory data analysis. And it’s hard to evaluate exploratory data analysis in the absence of a specific argument.

For instance, does syuzhet smooth plot arcs appropriately? I don’t know. Without a specific thesis we’re trying to test, how would we decide what scale of variation matters? In some novels it might be a scene-to-scene rhythm; in others it might be a long arc. Until I know what scale of variation matters for a particular question, I have no way of knowing what kind of smoothing is “too much” or “too little.”*

The same thing goes, more fundamentally, for the concepts of “plot” and “emotional valence” themselves. As Jacob Eisenstein has pointed out, these aren’t concepts that have a single agreed-upon meaning. To argue about them meaningfully, we’re going to need a particular historical or formal question we’re trying to solve.

It seems to me likely that syuzhet will usefully illuminate some aspects of plot. But I have no way of knowing which aspects until I look at a test involving groups of books that readers perceive as different in some specific way. For instance, if syuzhet reliably discriminates between books with tragic and comic endings, that would already be interesting. It’s not everything we mean by plot, but it’s one important thing.

The underlying issue here is that Matt hasn’t published his article yet. So we don’t actually have a thesis to debate. What we have is a new form of exploratory data analysis, released as an R package. Conversation about exploration can be interesting; it can teach me a lot about low-pass filters; but I don’t know how it could be wrong or right until I know what the exploration is trying to reveal.

I think this holds even for Matt’s claim that he’s identified six (or seven) fundamental plot patterns. That sounds like a thesis, but I would tend to say it’s still description of exploratory analysis — in this case a clustering process. Matt has done the clustering in a principled and careful way, but clustering is still (in my eyes) basically an exploratory method. I’m not sure how to evaluate it until I know what kind of generic or historical evidence would count as confirmation that we’re looking at a coherent “plot pattern.”

There are a range of ways to get that confirmation. Lynn Cherny has explored plot using supervised methods; if you do that, predictive accuracy gives you an easy test. But unsupervised methods can also be great, in cases where tests aren’t so easy to define; it’s just that an unsupervised method needs to be supplemented by historical or formal discussion that tells you what would count as confirmation for this method. I imagine there will be some of that in Matt’s article, when it comes out.

* [Edit March 31: After playing around with some artificial data myself, I have to acknowledge that the low-pass filter option in syuzhet can behave in unintuitive ways where extreme outliers and edges are involved. I think Annie Swafford (in blog posts) and Daniel Lepage (below) have been right to emphasize this. It could be less of an issue with real data; I had to use pretty extreme outliers to “break” the filter; it’s not actually the case that the whole shape is necessarily defined by its single highest point. But my guess is that this sort of filter would only add value if you wanted to build in a strong prior that plot fluctuates on or near a particular “wavelength.” On the other hand, Matt Jockers has alluded to unpublished evidence for that sort of prior (or at least for a particular filter setting). So, after changing my opinion a couple times, I’m still not feeling I have an answer here.]

By tedunderwood

Ted Underwood is Professor of Information Sciences and English at the University of Illinois, Urbana-Champaign. On Twitter he is @Ted_Underwood.

20 replies on “Why it’s hard for syuzhet to be right or wrong yet.”

Thanks Ted, for a thoughtful response to the conversation. There are a lot of knobs here, so a lot to *explore*, and it’s easy to get caught up in tuning parameters. The main reason I stuck the code on github was because I hoped that others would experiment with the knobs and run some real books through the tool and see if the resulting graphs were a fair representation of the shape of the “story” (in Vonnegut’s sense). A few have done this and reported their results, which is awesome.

The conversation we’ve been having over the best filter value and about whether or not the sentiment analysis is really good enough or not is certainly important, but it reminds me a bit of the debates over how many topics to choose and where to set the hyper parameter in LDA. These are important to be sure, and I’ve built many a topic model trying to find that ideal k, but ultimately the question for me is whether or not a reasonable person would agree that topic 5 is about “seafaring” and that “seafaring” really is an important part of Moby Dick.

Regarding Syuzhet, at least for the time being, I’m interested in something similar, which is to say, whether or not the approximated shape is a fair representation of the *relative* emotional highs and lows, and then I’m interested in what parameters were used to get that fair approximation. Maybe low pass is not the best option. Maybe we need to use a gaussian filter to get the best shape (something Annie Swafford suggested to me in an email)? Maybe a moving average would be better (as suggested by Daniel Lepage)? These are great ideas, and while we toss them into the transient winds of Twitter, the code just sits there waiting to be forked.

The analogy to topic modeling makes sense to me. Both methods are hard to validate, because more or less unsupervised. In both cases, it’s tempting to debate their validity on a priori grounds, but that doesn’t get us very far. I would say they’re both basically exploratory methods, and the question of validity becomes a question of what patterns people are, or aren’t, able to turn up with them (and then validate in a separate process).

But that point keeps getting lost with topic modeling as well! The reason people keep debating parameter settings and so on is that humanists have a tendency to want to interpret LDA as an “analytic tool” rather than an exploratory method. I.e., we really want it to be oracular; we may listen patiently while people explain that it’s not, but then we want to know that we’ve got “the right” parameter settings so that we can go ahead and treat it as oracular.

I’m starting to feel that unsupervised methods need to carry a kind of FDA warning label. “This machine generates beautiful visualizations that are known to the State of California not to be validated yet by any out-of-sample predictive test. Use at your own risk.”

However, thanks for sharing syuzhet! Sharing code is a lot of work, and it’s very generous of you. If I had an extra two weeks, I would definitely be running a lot of books through it right now.

This has been such a great conversation to follow — thanks to Annie and Matt and the others. I especially appreciate your emphasis here, Ted, on exploratory as opposed to analytic methods. Analysis in the humanities, as we all know and have discussed much, involves a particular, situated, embodied person with a perspective. Using these features or those, changing this parameter setting or that one, reading these results (rather than those) with this visualization (rather than that one) are all choices that correspond to the perspective of the person making those tweaks and the perspective (if you want to call it) that the machine (hardware, platform, software, algorithm, interface) has been calibrated to foreground. It’s mediated. Of course, this isn’t to say that exploratory methods aren’t also directed in many senses. In my mind, I have the image of a gal in the field with a magnifying glass. What she will discover is contingent as much on the daisies at her feet as it is by the ailing mother she left in the house behind her, her tired feet, the crack in the glass. I agree that the constructed, interpretive, and performative aspect of this work is what keeps us engaged and debating. Thanks to you all for transparency!

Thanks, Tanya. I appreciate your endorsement of exploratory methods. What you say is persuasive, and I think it’s a great description of the way unsupervised methods like topic modeling work. People do basically tweak parameters to get results that make sense for their particular approach. Which is fine.

But it won’t have escaped you that I’m actually a little lukewarm about unsupervised methods. What I actually love and trust (from my own subjective, embodied perspective!) are predictive models that can be tested. I agree they’re still constructed and interpretive and etc. … but … I also know when they’ve stopped working. And I like that reassurance.

I appreciate your appreciation! I also agree that the tweaking in topic modeling and machine learning is part of the art of the work and completely appropriate. At the same time, I don’t think that the subjective aspect of this is simply lip service to some humanist overlord of situationality (in fact, that entity sounds frightening and I would hate to meet her). In my own work, I have found that testing predictive models still brings me back to whether or not my “ground truth” was actually true. In other words, one can argue that a predictive model is not working because the results don’t seem to be accurate (a sense of accuracy which is measured against what is “right” or “wrong” in the first place). If you what you are trying to measure, however, is subjective from the beginning (such as, in my case, what the nature of a sound might be), then it’s much more difficult to tell whether or not your model is working well and, it seems to me, it starts to feel as if the whole process is a little arbitrary and . . . constructed and interpretive and etc. . .

Fair enough. There are definitely cases where “ground truth” becomes slippery. And now I kind of want to meet the Humanist Overlord of Situationality.

While it’s true that there haven’t been any peer-reviewed publications about Syuzhet, I would think that the theses of the various blog posts themselves are worth discussing. The claim that Syuzhet (with its default parameters) allows us to conclude that there are only 6 or 7 basic plot shapes, in particular, is certainly appearing as a headline in multiple news articles, so if this isn’t actually a thesis then someone really needs to clarify that quickly before more people get the wrong idea.

Also, you raised the question “does Syuzhet smooth plot arcs appropriately?”, and we actually can answer that question without looking at any results at all: it doesn’t, because Syuzhet doesn’t smooth data. It approximates the original signal with smooth curves, which is a very different process, and which isn’t an appropriate tool for Syuzhet’s stated use case of identifying “the simple shape of stories”.

Ideal low-pass filters are meant to be used on signals that are very long compared to their cutoff frequency. In audio engineering, for example, a 100hz filter is a pretty low filter to use, but would nonetheless have a period of 1/1000th of the width of a ten-second audio file. The low-pass filter that produced the data for the clustering, on the other hand, has a period of half of the total width of the signal, which is virtually guaranteed to grossly distort the data – the width of the ringing artifacts is inversely proportional to the highest frequency that the filter lets through, which means in this case we should expect the ringing to affect most of the signal. At best, we’d expect that Syuzhet might correctly identify the most extreme point in the original signal, but the rest of the foundation shape would largely be determined by the ringing around that point.

In this sense, the analogy to LDA’s parameters isn’t quite right – it’s not about setting a parameter, it’s about choosing a method. Using a low-pass filter to reduce 5500 measurements down to four numbers is perhaps comparable to doing topic modeling by simply applying PCA to a bunch of documents and using the components as topics – you’re using a powerful tool, and the end result looks kind of a like a topic model, but it’s completely the wrong tool for the job and you shouldn’t expect to learn much about the latent topics in your documents from it. You don’t even need to run it to know that LDA would produce better results.

We can see this pretty clearly in all the examples in Annie Swafford’s blog post. It’s also pretty clear in Matt Jocker’s most recent post – all eight figures have obvious discrepancies between the foundation shape and the original signal, despite his protestations to the contrary. Figures 3 & 4 are particularly telling – changing the last third of the book actually inverts the foundation shape of the first two thirds. Obviously they can’t both be good representations of the “latent shape” of the first two-thirds, because they say opposite things. Given our knowledge of low-pass filters, the explanation is pretty obvious – that final third, which now has the strongest consistent signal, completely reshapes the rest of the foundation shape through its ringing artifacts. (This has strong implications for the clustering as well: here are two fake novels that are literally identical for the first two thirds, but Syuzhet’s foundation shapes make it look like they are completely different everywhere.)

When doing exploratory analysis, this sort of discovery has to affect the way you explore – once you know that your tool isn’t doing what you intended, further exploration is a waste of your time, because anything else you discover (like the 6 or 7 fundamental plot shapes) is suspect: maybe it’s an insight into your data, but maybe it’s just a result of the limitations of your tool.

Thanks, Daniel. I appreciate your contribution, and Swafford’s posts, but I respectfully disagree. What’s actually going on here is that syuzhet is trying to infer a simpler curve from complex evidence, using parameters set by the user. We could argue about whether that’s “smoothing” (it’s certainly less like moving-average smoothing than it is like, say, loess smoothing). But that may be a semantic question. I’m happy to agree with you about the underlying point that syuzhet’s filtering is an inferential process rather than a simply descriptive one. I would also be happy to agree that it seems to entail a strong prior assumption that plot trajectories approximate a sine wave. When people say syuzhet “distorts” curves, they basically seem to mean it forces everything to be sinusoidal, which is true — but also an explicit prior. (The question underlying the debate you & Swafford have been having with Jockers may be, whether it’s an appropriate prior — and I don’t claim to know the answer.)

Where I disagree is with the notion that we can know what sort of exploratory data analysis is appropriate for fiction or for plot — before we know what specific historical or formal claim we’re trying to test — just by looking at graphs.

Your analogy to LDA/PCA seems to me fair enough, but I would draw a different conclusion from it. I don’t actually know, in the abstract, whether LDA or PCA (or neither) will help me answer questions about plot, until I have a specific question to answer. But I think both of those methods could conceivably be appropriate for exploration — and I think exploratory data analysis is what syuzhet is actually doing. I don’t want to see certain forms of exploration ruled out of court in advance; it’s not clear to me what we collectively gain from that.

That’s a good point, and I’d agree with you if the narrative around Syuzhet were that simply that it’s a tool for turning every novel into a simple curve. But from the very beginning, Syuzhet has been pitched as a tool specifically for finding “plot shapes” in the sense meant by Vonnegut in his masters’ thesis, and that’s the claim I was specifically addressing above. That is a testable claim, and it’s clear from virtually every published Syuzhet graph that it doesn’t find plot shapes in Vonnegut’s sense.

In the LDA/PCA analogy, it’s absolutely true that you might discover something interesting using pure PCA. But to claim that the resulting components were “topics” in the topic modeling sense would simply be factually wrong. This doesn’t mean they’re not interesting, it just means you can’t call them what they’re not.

This is not to say that Syuzhet isn’t a good tool for exploratory data analysis. But as with any exploratory tool, the goal is ultimately to learn something about our data, and so as soon as we do find something (such as “there are only 6 or 7 fundamental plot shapes” or “The plot of Portrait of the Artist follows this curve”) we have to put the exploration on hold and validate our findings. Otherwise we can’t really know if our exploration succeeded – did we just discover something new about literature, or did we just discover a limitation of our tools?

In this particular case, I am convinced that the “6 or 7 fundamental plot shapes” conclusion is an artifact of the resampling process – when you’re reducing every book to a mere two Fourier terms[1], it shouldn’t be surprising that they cluster well. So while the exploratory analysis did discover something, the resulting claim (that this is an interesting observation about novels) is wrong – it’s just an interesting discovery about 2nd-order sinusoidal approximations.

As a side note, I’m afraid I also have to disagree with your claim that ‘We could argue about whether that’s “smoothing” ‘ – “smoothing” has a specific meaning in mathematics, and Syuzhet’s algorithm is simply not smoothing. This isn’t an interpretive issue or a matter of opinion, it’s just a matter of mathematical definitions.

[1] There are only two terms because Syuzhet recenters every foundation shape around 0 (on the grounds that only the shape matters). This means the first term of the Fourier series has no effect; since the low-pass filter used for the clustering removes everything above the third, only two terms have any effect. As a side note, this also means that Syuzhet is reducing every novel to only 4 numbers, but then upsampling back to 100 terms before clustering; this is also concerning, because it (provably) can only make the clustering results worse.

Thanks again, Daniel. I completely agree on the substantive point here, which is that literary scholars have no reason yet to conclude that there are only “six fundamental plot shapes.”

In my view, the most important reasons not to reach that conclusion are

a) literary scholars haven’t agreed yet on a definition of “plot” — certainly not one that reduces it to two dimensions! — and

b) even if we had a consensus definition of “plot,” it wouldn’t be a good idea to reach conclusions about the number of fundamental plot patterns based purely on clustering. Most clustering methods start by assuming that there are discrete clusters to find — which might or might not be true here. And even if we wanted to provisionally make that assumption, we would definitely want to follow up by asking whether the “clusters” we found were corresponding to some recognizable historical or generic patterns before basing a literary-historical claim on clustering. (For all I know, of course, Matt may do this sort of checking in the article; I haven’t seen the article, so I’m just saying “we have no reason yet to believe this claim.”)

So we agree on the substantive literary-historical point here. I think where we diverge is that you’d like to resist the “6 or 7 plot shapes” conclusion on strictly quantitative grounds: the resampling algorithm in syuzhet seems to you inherently inappropriate. Now, we don’t diverge here because I want to argue that syuzhet has the right algorithm (filtering curves so they become sinusoidal certainly imposes a strong prior, and your remark about the interaction of filtering and clustering also makes good sense to me — simplifying the curves is very likely to reduce the number of clusters).

Rather, we diverge because I really want to resist a premise that seems to have governed both sides of the argument so far* — the notion that these questions can be decided a priori at all.

Switching out one filtering algorithm for another would not change my conviction that clustering, by itself, is an insufficient basis for literary-historical conclusions. Clustering, topic modeling, PCA, and so on are all guaranteed to find some kind of order within a particular simplified representation. But that guarantee is a problem. I would say we don’t really have a historical argument until we make some effort to test these models — either quantitatively (out of sample), or simply by mapping the model onto ordinary literary evidence outside the representation we used to create the model.

That’s why I wrote the blog post. I’m very willing to defer to you (and Matt, and Annie) on the choice between filtering algorithms. What matters to me is the larger point that literary history isn’t resolvable by a priori arguments. I don’t even think we know, purely a priori, what kind of exploration is likely to work. I’ve seen situations where we said “this simplified representation is obviously too crude to be useful” — and turned out to be wrong — so I lean strongly toward trying things on a historical scale, validating our results out of sample, and only then arguing about priors.

* edit: Actually, on reflection, I think my remark that “both sides” have relied on a priori argument is a bit unfair to Matt, because he’s consistently said that these patterns need to be checked against our readings of particular novels.

I think the main point where we diverge is actually on the question of the purpose of Syuzhet. You’re reading it as a purely exploratory package – it computes some functions of books and compare them to see if anything can be learned by it. And were that the case, I’d agree that it can’t really be “right” or “wrong” – it’s just a tool, and anything it produces might tell us something interesting.

But Syuzhet isn’t purely exploratory: It’s exploration guided by specific goals. In the author’s own words, “Syuzhet was designed to estimate and smooth the emotional highs and lows of a narrative”, and so we can have some a priori intuition about whether it is the “right” or “wrong” tool for this specific task.

Consider, for example, a hypothetical software package called Teryat (from the Russian “Терять” meaning “to lose, waste, or shed”) that is identical to Syuzhet except that, when the time comes to produce 100-element foundation shapes from much longer books, it does so by taking the first 100 sentences of the novel and simply throwing out the rest of the data (hence the name).

It’s entirely possible that you could learn something interesting from the shapes produced by Teryat. But it’s also obvious that whatever you learned would only tell you about the first 100 sentences of your novels – it could not possibly capture or smooth the highs and lows of a narrative because it doesn’t even consider the majority of the narrative. This means that we can, a priori and on strictly quantitive grounds, conclude that Teryat isn’t the right tool for this job.

Syuzhet’s extreme low-pass filter is not quite as bad as this, but it’s still pretty bad, and for similar reasons. The only difference is that these reasons aren’t obvious without a deeper understanding of Fourier transforms and resampling theory.

With that understanding, we can, a priori, know that the Syuzhet’s results aren’t going to tell us very much about Vonnegut’s “plot shapes”, for the same reasons that we could say the same thing about Teryat. It’s not that the low-pass filter is “inherently inappropriate”, but that it’s inherently inappropriate for the stated goals of Syuzhet.

As a side note, I agree that it’s unfair to characterize Matt Jockers’s blog posts as trying to resolve literary history via a priori arguments, because he is trying to compare his results to his own interpretations of the input plots. But I think it’s also unfair to characterize Annie Swafford’s posts as doing this, either. Her posts are focusing on the same thing that I am in this response (namely, that Syuzhet doesn’t achieve its stated goals), which isn’t a literary or historical conclusion so much as an algorithmic one, and what’s more, while I truly am claiming this based solely on my a priori knowledge as a mathematician and programmer, she’s actually been providing concrete examples of the sorts of problems that I’m predicting here.

I love the analogy to “Teryat.” And I have to admit that taking the first 100 sentences of a book seems to me like a case where I would just say “nope — that’s not going to work — or, if it does, I’ll be amazed.” I’d be willing to venture a priori judgment there!

But suppose we change it a little and make it a package that estimates plot by taking the first sentence from each chapter, or by using outlier sentences from each chapter. That’s a closer analogy to syuzhet, I think. It still throws out a lot of data. But in a case like that, I’d be inclined to say “I don’t know — it could work, at least statistically and in the aggregate. I don’t really know whether it will or not until I try it on a few hundred books. Maybe outliers, or first sentences, are really important.”

Part of the reason I’m so unwilling to judge algorithms in advance is that I suspect we don’t yet have a good understanding of the goal of the algorithm. You believe syuzhet can be judged by its fitness for the goal of capturing “Vonnegut’s plot shapes.” I might agree, if I were confident that they actually existed.

But as a literary scholar, I’m painfully conscious that we have no consensus about “plot.” Vonnegut was just making stuff up, honestly, and there aren’t actually a huge number of critics who have agreed with him about the nature of plot (i.e., that it’s mainly a function of happiness / sadness). So I’m really not confident that “this package correctly captures Vonnegut’s plot shapes” is a testable hypothesis. Those durn things might not even exist. Or they might only exist in a weak form that requires aggressive generalization/resampling/smoothing to be visible as a pattern at all. Or we might need to poke around in an exploratory way, trying lots of different things, before we formulated a testable hypothesis about the real relationship between “emotional valence” and “plot.” That’s the key reason why I think syuzhet is only exploratory; I doubt it actually can be more than that.

However, I confess that if I were using syuzhet, I’d probably try it first with some version of moving-average smoothing rather than a low-pass filter. Annie Swafford’s blog posts convinced me that low-pass filters impose some very strong prior assumptions. I don’t know those priors are wrong, but it’s a good exploratory heuristic to start with weak priors if you can.

I realize this way of putting things may seem exaggeratedly cautious, but I’m convinced my field thinks it knows much more than it actually does, and getting people to say “we don’t know yet” is Priority #1 for me.