reproducibility and replication – The Stone and the Shell

We can save what matters about writing—at a price

It’s beginning to sink in that generative AI is going to force professors to change their writing assignments this fall. Corey Robin’s recent blog post is a model of candor on the topic. A few months ago, he expected it would be hard for students to answer his assignments using AI. (At least, it would require so much work that students would effectively have to learn everything he wanted to teach.) Then he asked his 15-year-old daughter to red-team his assignments. “[M]y daughter started refining her inputs, putting in more parameters and prompts. The essays got better, more specific, more pointed.”

Perhaps not every 15-year-old would get the same result. But still. Robin is planning to go with in-class exams “until a better option comes along.” It’s a good short-term solution.

In this post, I’d like to reflect on the “better options” we may need over the long term, if we want students to do more thinking than can fit into one exam period.

If you want an immediate pragmatic fix, there is good advice out there already about adjusting writing assignments. Institutions have not been asleep at the wheel. My own university has posted a practical guide, and the Modern Language Association and Conference on College Composition and Communication have (to their credit) quickly drafted a working paper on the topic that avoids panic and makes a number of wise suggestions. A recurring theme in many of these documents is “the value of process-focused instruction” (“Working Paper,” 10).

Why focus on process? A cynical way to think about it is that documenting the writing process makes it harder for students to cheat. There are lots of polished 5-page essays out there to imitate, but fewer templates that trace the evolution of an idea from an initial insight, through second thoughts, to dialectical final draft.

Making it harder to cheat is not a bad idea. But the MLA-CCCC task force doesn’t dwell on this cynical angle. Instead they suggest that we should foreground “process knowledge” and “metacognition” because those things were always the point of writing instruction. This is much the same thesis Corey Robin explores at the end of his post when he compares writing to psychotherapy: “Only on the couch have I been led to externalize myself, to throw my thoughts and feelings onto a screen and to look at them, to see them as something other, coldly and from a distance, the way I do when I write.”

Midjourney: **“a hand writing with a quill reflected in a mirror, by MC Escher, in the style of meta-representation –ar 3:2 –weird 50”**

Robin’s spin on this insight is elegiac: in losing take-home essays, we might lose an opportunity to teach self-critique. The task force spins it more optimistically, suggesting that we can find ways to preserve metacognition and even ways to use LLMs (large language models) to help students think about the writing process.

I prefer their optimistic spin. But of course, one can imagine an even-more-elegiac riposte to the task force report. “Won’t AI eventually find ways to simulate critical metacognition itself, writing the (fake) process reflection along with the final essay?”

Yes, that could happen. So this is where we reach the slightly edgier spin I feel we need to put on “teach the process” — which is that, over the long run, we can only save what matters about writing if we’re willing to learn something ourselves. It isn’t a good long-term strategy for us to approach these questions with the attitude that we (professors) have a fixed repository of wisdom — and the only thing AI should ever force us to discuss is, how to convey that wisdom effectively to students. If we take that approach, then yes, the game is over as soon as a model learns what we know. It will become possible to “cheat” by simulating learning.

But if the goal of education is actually to learn new things — and we’re learning those things along with our students — then simulating the process is not something to fear. Consider assignments that take the form of an experiment, for instance. Experiments can be faked. But you don’t get very far doing so, because fake experiments don’t replicate. If a simulated experiment does reliably replicate in the real world, we don’t call that “cheating” — but “in-silico research that taught us something new.”

If humanists and social scientists can find cognitive processes analogous to experiment — processes where a well-documented simulation of learning is the same thing as learning — we will be in the enviable position Robin originally thought he occupied: students who can simulate the process of doing an assignment will effectively have completed the assignment.

I don’t think most take-home essays actually occupy that safe position yet, because in reality our assignments often ask students to reinvent a wheel, or rehearse a debate that has already been worked through by some earlier generation. A number of valid (if perhaps conflicting) answers to our question are already on record. The verb “rehearse” may sound dismissive, but I don’t mean this dismissively. It can have real value to walk in the shoes of past generations. Sometimes ontogeny does need to recapitulate phylogeny, and we should keep asking students to do that, occasionally — even if they have to do it with pencil on paper.

But we will also need to devise new kinds of questions for advanced students—questions that are hard to answer even with AI assistance, because no one knows what the answer is yet. One approach is to ask students to gather and interpret fresh evidence by doing ethnography, interviewing people, digging into archival boxes, organizing corpora for text analysis, etc. These are assignments of a more demanding kind than we have typically handed undergrads, but that’s the point. Some things are actually easier now, and colleges may have to stretch students further in order to challenge them.

“Gathering fresh evidence” puts the emphasis on empirical data, and effectively preserves the take-home essay by turning it into an experiment. What about other parts of humanistic education: interpretive reflection, theory, critique, normative debate? I think all of those matter too. I can’t say yet how we’ll preserve them. It’s not the sort of problem one person could solve. But I am willing to venture that the meta-answer is, we’ll preserve these aspects of education by learning from the challenge and adapting these assignments so they can’t be fulfilled merely by rehearsing received ideas. Maybe, for instance, language models can help writers reflect explicitly on the wheels they’re reinventing, and recognize that their normative argument requires another twist before it will genuinely break new ground. If so, that’s not just a patch for writing assignments — but an advance for our whole intellectual project.

I understand that this is an annoying thesis. If you strip away the gentle framing, I’m saying that we professors will have to change the way we think in order to respond to generative AI. That’s a presumptuous thing to say about disciplines that have been around for hundreds of years, pursuing aims that remained relatively constant while new technologies came and went.

However, that annoying thesis is what I believe. Machine learning is not just another technology, and patching pedagogy is not going to be a sufficient response. (As Marc Watkins has recently noted, patching pedagogy with surveillance is a cure worse than the disease.) This time we can only save what matters about our disciplines if we’re willing to learn something in the process. The best I can do to make that claim less irritating is to add that I think we’re up for the challenge. I don’t feel like a voice crying in the wilderness on this. I see a lot of recent signs — from the admirable work of the MLA and CCCC to books like The Ends of Knowledge (eds. Scarborough and Rudy) — that professors are thinking creatively about a wide range of recent challenges, and are capable of responding in ways that are at once critical and self-critical. Learning is our job. We’ve got this.

References

Center for Innovation in Teaching and Learning, UIUC. “Artificial Intelligence Implications in Teaching and Learning.” Champaign, IL, 2023.

MLA-CCCC Joint Task Force on Writing and AI, “MLA-CCCC Joint Task Force on Writing and AI Working Paper: Overview of the Issues, Statement of Principles, and Recommendations,” July 2023.

Robin, Corey. “How ChatGPT Changed My Plans for the Fall,” July 30, 2023.

Rudy, Seth, and Rachel Scarborough King, The Ends of Knowledge: Outcomes and Endpoints across the Arts and Sciences. London: Bloomsbury, 2023.

Watkins, Marc. “Will 2024 look like 1984?” July 31, 2023.

New methods need a new kind of conversation

Over the last decade, the (small) fraction of articles in the humanities that use numbers has slowly grown. This is happening partly because computational methods are becoming flexible enough to represent a wider range of humanistic evidence. We can model concepts and social practices, for instance, instead of just counting people and things.

That’s exciting, but flexibility also makes arguments complex and hard to review. Journal editors in the humanities may not have a long list of reviewers who can evaluate statistical models. So while quantitative articles certainly encounter some resistance, they don’t always get the kind of detailed resistance they need. I thought it might be useful to stir up conversation on this topic with a few suggestions, aimed less at the DH community than at the broader community of editors and reviewers in the humanities. I’ll start with proposals where I think there’s consensus, and get more opinionated as I go along.

1. Ask to see code and data.

Getting an informed reviewer is a great first step. But to be honest, there’s not a lot of consensus yet about many methodological questions in the humanities. What we need is less strict gatekeeping than transparent debate.

As computational methods spread in the sciences, scientists have realized that it’s impossible to discuss this work fruitfully if you can’t see how the work was done. Journals like Cultural Analytics reflect this emerging consensus with policies that require authors to share code and data. But mainstream humanities journals don’t usually have a policy in place yet.

Three or four years ago, confusion on this topic was understandable. But in 2018, journals that accept quantitative evidence at all need a policy that requires authors to share code and data when they submit an article for review, and to make it public when the article is published.

I don’t think the details of that policy matter deeply. There are lots of different ways to archive code and data; they are all okay. Special cases and quibbles can be accomodated. For instance, texts covered by copyright (or other forms of IP) need not be shared in their original form. Derived data can be shared instead; that’s usually fine. (Ideally one might also share the code used to derive it.)

2. … especially code.

Humanists are usually skeptical enough about the data underpinning an argument, because decades of debate about canons have trained us to pose questions about the works an author chooses to discuss.

But we haven’t been trained to pose questions about the magnitude of a pattern, or the degree of uncertainty surrounding it. These aspects of a mathematical argument often deserve more discussion than an author initially provides, and to discuss them, we’re going to need to see the code.

I don’t think we should expect code to be polished, or to run easily on any machine. Writing an article doesn’t commit the author to produce an elegant software tool. (In fact, to be blunt, “it’s okay for academic software to suck.”) The author just needs to document what they did, and the best way to do that is to share the code and data they actually used, warts and all.

3. Reproducibility is great, but replication is the real point.

Ideally, the code and data supporting an article should permit a reader to reproduce all the stages of analysis the author(s) originally performed. When this is true, we say the research is “reproducible.”

But there are often rough spots in reproducibility. Stochastic processes may not run exactly the same way each time, for instance.

At this point, people who study reproducibility professionally will crowd forward and offer an eleven-point plan for addressing all rough spots. (“You just set the random number seed so it’s predictable …”)

That’s wonderful, if we really want to polish a system that allows a reader to push a button and get the same result as the original researcher, to the seventh decimal place. But in the humanities, we’re not always at the “polishing” stage of inquiry yet. Often, our question is more like “could this conceivably work? and if so, would it matter?”

In short, I think we shouldn’t let the imperative to share code foster a premature perfectionism. Our ultimate goal is not to prove that you get exactly the same result as the author if you use exactly the same assumptions and the same books. It’s to decide whether the experiment is revealing anything meaningful about the human past. And to decide that, we probably want to repeat the author’s question using different assumptions and a different sample of books.

When we do that, we are not reproducing the argument but replicating it. (See Language Log for a fuller discussion of the difference.) Replication is the real prize in most cases; that’s how knowledge advances. So the point of sharing code and data is often less to stabilize the results of your own work to the seventh decimal place, and more to guide investigators who may want to undertake parallel inquiries. (For instance, Jonathan Goodwin borrowed some of my code to pose a parallel question about Darko Suvin’s model of science fiction.)

I admit this is personal opinion. But I stress replication over reproducibility because it has some implications for the spirit of the whole endeavor. Since people often imagine that quantitative problems have a right answer, we may initially imagine that the point of sharing code and data is simply to catch mistakes.

In my view the point is rather to permit a (mathematical) conversation about the interpretation of the human past. I hope authors and readers will understand themselves as delayed collaborators, working together to explore different options. What if we did X differently? What if we tried a different sample of books? Usually neither sample is wrong, and neither is right. The point is to understand how much different interpretive assumptions do or don’t change our conclusions. In a sense no single article can answer that question “correctly”; it’s a question that has to be solved collectively, by returning to questions and adjusting the way we frame them. The real point of code-sharing is to permit that kind of delayed collaboration.