Does ChatGPT hate split infinitives as much as I do?

Jeffrey A Greene

29 Mar 2024 • 3 min read

A student sitting at a desk looking at a desktop computer with arms holding a pencil, where the screen is a paper with red and green checks and a smiley face on top, suggesting the computer has graded the paper.

I consider myself very fortunate to have gone to a high-school with dedicated, skillful teachers who put in enormous amounts of time to hone their students' writing. To this day, I remember their cheeky advice: "Ending a sentence with a preposition is like stubbing your toe!" and "Don't start a sentence with 'because', because it reads like a fragment!" Now, as someone who must provide feedback on students' writing (likely way too much feedback, if you ask my students), I know just how much time it takes (i.e., A LOT). Wouldn't it be wonderful if there were automated ways to provide initial feedback on writing quality? Could large language models like ChatGPT do that work for teachers? Steiss et al. (2024) conducted a study to find out.

Now, I've been pretty critical of claims that generative AI can do anything and everything, but that doesn't mean we shouldn't investigate what it can do, and how well. That's what Steiss and colleagues did, and what they found was pretty fascinating. For a previous study, they had asked sixteen experienced educators, writing researchers, and graduate students majoring in literacy education to provide feedback on 200 essays from 6th to 12th graders. These raters were trained and given rubrics before providing feedback. On average, it took raters 20-25 minutes to provide feedback for an essay.

Then, they asked ChatGPT (version 3.5, not the most recent version 4.0) to provide feedback on those same essays. They tested different prompts for ChatGPT before finding that this one produced the best feedback:

“Pretend you are a secondary school teacher. Provide 2–3 pieces of specific, actionable feedback on each of the following essays … that highlight what the student has done well and what they could improve on. Use a friendly and encouraging tone. If needed, provide examples of how the student could improve the essay” (Steiss et al., 2024, p. 4)

Of note, they didn't provide ChatGPT with a rubric or the scoring scale. Then, they had researchers rate the quality of both human and ChatGPT feedback (raters did not know which was which, but apparently it was easy to tell) across four qualities: how well the feedback referenced the criteria for sourced-based argumentation writing, how clear the directions for improvement were, how much the feedback prioritized the essential features of the essay assignment, and how supportive the feedback was.

They found that humans' feedback, on average, was of higher quality than ChatGPT's, but not by very much: "Although there were some small to moderate differences between human feedback and ChatGPT, the ChatGPT feedback was still of relatively high quality" (p. 7). There was some evidence that ChatGPT didn't do as well with high-quality essays, but there were no statistically detectable differences in feedback quality between essays written by initially fluent v. English learners. And ChatGPT graded those papers fast, requiring no incentive to maintain volition, like eating M&Ms while grading (as some people do, I mean, I've heard that, not that I, well, sometimes...anyway).

So, what's it all mean? The authors summed it up nicely:

"Even if ChatGPT’s feedback is not perfect, it can still facilitate writing instruction by engaging and motivating students and assisting teachers with managing large classes, thus providing them more time for individual feedback or differentiated writing instruction (Grimes & Warschauer, 2010). Given our results, we see a plausible use case for generative AI: providing feedback in the early phases of writing, where students seek immediate feedback on rough drafts. This would precede, not replace, teacher-provided formative or summative evaluation that is often more accurate and more tailored to student-specific characteristics, albeit less timely" (p. 13).

I still have my doubts about the long-term viability of generative AI (e.g., AI-generated content polluting training models, leading to model collapse) but I do have to give ChatGPT its props here. If it could help students produce a better first draft for teachers to evaluate, well, that does sound like a valuable contribution. Because grading time has turned out to really be something I am worried about. (Oof, that sentence! ChatGPT - see if you can correct that abomination!)

Sign up for more like this.