When AI Grades Essays It’s Not Grading What You Think

Why trusting AI to grade essays feels like handing your novel to a stranger

In the age of AI, the idea that a machine could grade your university essay sounds like a dream come true. No more waiting weeks for feedback, no more human bias or fatigue. Instead, a super-smart algorithm reads your words, judges your arguments, and hands back a score. But a new study from Università Cattolica del Sacro Cuore in Milan, led by Andrea Gaggioli and colleagues, throws cold water on that fantasy. Their research reveals that even the most advanced Large Language Models (LLMs)—including GPT-4 and Claude 3.5—struggle to match human judgment when it comes to assessing complex student essays.

It’s not just about whether AI can read text; it’s about whether it can truly understand the nuances of academic writing, originality, and practical feasibility. The study’s findings suggest that while these models can produce consistent scores, they often miss the mark on what really matters in higher education assessment.

The allure and challenge of automated essay scoring

Essay grading is a notoriously thorny task. Unlike multiple-choice tests, essays demand interpretation, critical thinking, and an appreciation of context. Human graders bring their expertise, intuition, and sometimes biases to the table. But they also get tired, distracted, or inconsistent. Automated Essay Scoring (AES) systems have been around since the 1960s, evolving from simple word counts to sophisticated machine learning models. The latest generation—LLMs trained on vast swaths of text—promise to revolutionize this space by understanding language more deeply and flexibly.

Gaggioli’s team put five cutting-edge LLMs to the test: OpenAI’s GPT-4, Anthropic’s Claude 3.5, Google’s Gemini 2.5, Mistral 24B, and DeepSeek v2. They fed these models 67 Italian-language essays from a university psychology course, each about 2,500 to 3,000 words long. The essays weren’t just any writing—they were student proposals for psychosocial interventions, evaluated by human experts on four criteria: Pertinence (relevance), Coherence (logical flow), Originality, and Feasibility (practicality).

Consistent but disconnected: AI’s scoring paradox

One might expect that if AI models are consistent in their scoring, they’d at least agree with human graders. But the study found otherwise. The LLMs were stable in how they scored the same essay multiple times, yet their scores barely correlated with human evaluations. In fact, the agreement was so low it was statistically insignificant. This means that even if an AI model gives the same score repeatedly, it might be consistently wrong—or at least consistently different from human judgment.

Interestingly, the models tended to inflate scores for coherence, giving essays a higher logical flow rating than human graders did. They also showed inconsistent handling of context-dependent criteria like pertinence and feasibility—areas that require understanding the specific academic discipline and real-world constraints. For example, an AI might rate an intervention proposal as highly feasible without grasping practical limitations that a human expert would spot immediately.

Different AIs, different rubrics

Another surprise was how differently the models scored the same essays. While some, like Claude 3.5 and Gemini 2.5, were generous and clustered scores near the maximum, others like Mistral 24B were more conservative and variable. This divergence suggests that each model operates with its own implicit rubric, shaped by its training data and architecture. So, swapping one AI grader for another isn’t just a technical upgrade—it’s a fundamental change in what’s being measured.

Moreover, when the researchers looked at how the models agreed with each other, they found moderate consensus only on coherence and originality. For pertinence and feasibility, agreement was negligible. This pattern underscores that AI is better at assessing surface-level features like structure and novelty than deeper, context-rich judgments.

Why this matters beyond the classroom

The implications of these findings ripple far beyond grading psychology essays in Milan. As universities worldwide grapple with growing class sizes and stretched resources, automated grading tools are tempting solutions. But if AI can’t reliably replicate human judgment—especially on complex, interpretive tasks—then relying on it risks unfair or misleading evaluations.

It also raises questions about the nature of assessment itself. Essays are not just about correctness; they’re about demonstrating understanding, creativity, and practical insight. These qualities are deeply human and context-dependent. The study reminds us that AI, for all its linguistic prowess, still lacks the disciplinary insight and pedagogical sensitivity that expert educators bring.

Keeping humans in the loop

Gaggioli and colleagues emphasize that human oversight remains critical. Automated systems might serve as helpful assistants—flagging essays for review, providing preliminary scores, or offering feedback on surface features. But final judgments, especially on open-ended academic work, should rest with humans who can interpret nuance and context.

The study also points to the need for hybrid approaches that combine AI’s scalability with human expertise. Future research might explore how prompt design, exemplar conditioning, and rubric calibration can improve AI reliability. But for now, the message is clear: AI grading is not a magic bullet, and trusting it blindly could undermine educational fairness and quality.

The road ahead

This research from Università Cattolica del Sacro Cuore, led by Andrea Gaggioli, is a timely reality check on the promises of AI in education. It invites educators, technologists, and policymakers to think critically about where and how AI fits into assessment. The allure of instant, automated grading is strong, but the complexity of human learning demands more than algorithms can currently deliver.

In a world increasingly enamored with AI, this study reminds us that some tasks—like understanding a student’s original thought and practical reasoning—still require the human touch. Until AI can grasp the full tapestry of context, creativity, and feasibility, it will remain a helpful tool, not a replacement for human judgment.