LLMs Transform the Way Students Prove Software Correctness

The classroom is quietly reshaped by a new kind of tutor—a large language model that chats, reasons, and writes code with a fluency that sometimes feels too fast to be real. In courses that teach formal methods, where students prove that programs do exactly what they are supposed to do, this AI tutor is both a promise and a puzzle. The promise is straightforward: if a student can ask the right questions, the model might illuminate corners of logic that are hard to see on your own. The puzzle is deeper: does this help learning, or does it just make the machine do the heavy lifting while the student skims along the surface of correctness?

A recent study tackles that question head on. A team of researchers, led by Carolina Carreira of Carnegie Mellon University and collaborating with teams at INESC TEC and the University of Porto in Portugal, along with IST in Lisbon, set up a careful experiment to see how students interact with a chat-based AI while solving deductive verification tasks in Dafny, a language that lets programmers specify and automatically verify what their code should do. The study isn’t about replacing human understanding with AI; it’s about whether AI can become a tool that amplifies a student’s own reasoning, and if so, how to design that collaboration so it teaches rather than substitutes. The results are nuanced, practical, and surprisingly optimistic in places, with clear warnings about when to rely on a machine and when to resist its siren call.

The Experiment Reimagined

The researchers recruited fourteen master’s students enrolled in a formal methods course and asked them to tackle two verification problems, named Queue and Tree. Each participant completed both problems, but in one of the two problems they had access to a custom chat interface that logged every prompt and every reply from a large language model, while in the other problem they worked unaided. The design is almost like a controlled experiment in a museum of thinking: you’re allowed to bring a thinking partner, but you still have to do the thinking yourself.

Two Dafny problems were chosen to reflect realistic, moderately challenging tasks in which students must implement methods that meet formal contracts and invariants. One subproblem asked the student to implement a method given its preconditions and postconditions, and the other asked the student to take a natural language goal and produce the specification and implementation. The base code for the tasks included a circular buffer implementing a queue and a binary search tree, complete with contracts written in Dafny. Students used the official Dafny extension in Visual Studio Code and ran the Dafny verifier locally, keeping the environment stable and comparable across participants.

Behind the scenes, the team built a controlled ChatGPT platform that captured 206 messages in total: 103 student prompts and 103 model responses. The model used was a version of GPT-4 with standard settings. The study randomized the order in which ChatGPT appeared for each participant and ensured that all students solved both problems under the same conditions, except for the presence or absence of the AI assistant. The goal was not to test a single magic prompt but to understand how people actually work with AI in a demanding theoretical task, and what patterns correlate with better outcomes.

All of this took place in a Portuguese university setting—the Formal Methods for Critical Systems course at the University of Porto—yet the insights feel universal: when you give students a consistent, well-behaved AI partner and, crucially, teach them how to use it, the collaboration can unlock performance that would be hard to reach otherwise. The study’s authors emphasize that this is not a miracle cure; it’s a tool whose value hinges on the student’s own reasoning and strategic use of prompts. And they don’t shy away from the caveats. AI can surface key ideas and help scaffold a solution, but it can also derail a learner if the prompts veer into guesswork or the student stops thinking for themselves.

What It Teaches Us About Learning with AI in Formal Methods

When the data came in, the numbers told a striking story. On average, students scored 17.39 out of 20 with AI assistance, versus 9.36 without. In other words, every participant improved when the AI helped, and the improvement was statistically significant. The difference wasn’t just about getting a few more points; in several cases, the AI turned a failure into a pass. The finding is tempting: with AI, students can reach levels of correctness they might struggle to achieve on their own in the time allotted.

But the researchers didn’t stop at the aggregate score. They dug into the substructure of the tasks and the kinds of work students did with the AI. The data showed that the biggest gains tended to come on implementation tasks—where you translate a specification into executable code that the verifier can check—rather than on the purely speculative or lemma-based parts of the work. In the study, the average score on the implementation side with AI was notably higher than on the specification side, hinting at a practical sweet spot for AI assistance: it helps when you’re turning a design into a working piece of verified code, less so when you’re wrestling with the abstract proof obligations themselves.

One revealing detail: a single student, whom the researchers highlight as a striking case, went from a near-zero unaided score to a near-perfect AI-assisted score, yet still reported very low confidence in their own ability. That tension—strong objective performance paired with shaky subjective trust—speaks to a deeper theme in AI-assisted learning: the machine can do the work, but the learner’s own sense of understanding and ownership may lag behind. The authors point out that this disconnect matters, because education is as much about shaping a durable, transferable understanding as it is about producing a correct answer on a single problem set.

From a methodological angle, the study also reveals that the quality of prompts matters a lot. The researchers cataloged how students interacted with the AI—what kinds of prompts they used, how they revised prompts after errors, and whether they supplied full context such as the whole Dafny class definition. They found that the best performers tended to include comprehensive context early, then steer the AI to focus on one subproblem at a time. In contrast, some students fell into traps: overloading the AI with extraneous lemmas and ghost variables, or relying on the AI to generate answers without applying their own reasoning. The practical takeaway is crisp: with AI in the loop, prompt design becomes a critical skill, not an optional add-on.

With that in mind, the study offers a nuanced view of the trust students place in AI. About half of the participants said they trusted the AI’s output, while the other half remained skeptical. Reasons for trust ranged from the AI producing correct code to the possibility of verifying the AI’s suggestions using the very verification tool being studied. Reasons for distrust included occasional syntax errors in the AI’s suggestions, overconfidence when wrong, or hallucinated features that simply do not exist in the Dafny world. The lesson here is not that AI is unreliable, but that human learners must cultivate a critical skepticism: you should be able to verify what the AI gives you, not assume it’s correct by default.

Lessons for the Classroom and Beyond

The paper doesn’t pretend to solve how to teach formal methods in a world of AI tutors; it offers a pragmatic cookbook for educators who want AI to support learning, not substitute for it. The authors distill three actionable recommendations that feel achievable in a semester-long course. First, design prompts that foreground the learner’s own context and break problems into manageable pieces. In practice, that means encouraging students to include the full class structure and focus on a single subproblem at a time, rather than dumping entire assignments into the AI and hoping for a quick fix. Second, teach productive prompt engineering as part of the curriculum. If students are going to rely on an AI partner, they should also learn how to refine prompts when the AI stalls, how to steer the model away from unnecessary complexity, and how to extract useful feedback rather than chase perfect results in a single attempt. Third, embrace the reality that not all tasks are equally AI-friendly. The study’s design team deliberately built LLM-resistant challenges that force students to engage with the specification and invariants themselves, not just the implementation. They found that while AI can be a powerful ally for implementation tasks, tasks centered on writing clear contracts and invariants may be less amenable to offloading to a model—at least with the current generation of tools.

Beyond the classroom, the findings resonate with industry where formal methods already play an important role in high-assurance software. If teams bring AI into the verification workflow, the work suggests they should be thoughtful about where the AI sits in the chain. AI can help surface ideas, propose initial implementations, or draft proofs, but human reviewers must still guide the reasoning, check edge cases, and insist on a robust, explainable understanding of why a particular approach is correct. In other words, AI should be a collaborator who clarifies and accelerates thinking, not a shortcut that replaces it.

The study is an encouraging sign that large language models can meaningfully augment the education of formal methods if used intentionally. It shows that the right kind of AI partner can lift students over some of the steepest cognitive hurdles, such as turning abstract specifications into verifiable code, while also revealing the limits of such assistance. The authors observe that the gains come with a cautionary note: LLMs are not a panacea, and over-reliance can blunt the deeper understanding that formal methods seek to cultivate. The path forward, they argue, lies in clear design of problems, explicit instruction on how to work with AI, and careful attention to how students build trust in the tools they use.

In the end, the study’s core message lands with a gentle but firm clarity: can large language models help students prove software correctness? Yes, but only if we teach students how to work with them as seasoned collaborators. The AI can do a lot of the heavy lifting, but it is the student who must steer, critique, and ultimately own the proof. When that happens, the result is not just a score on a test, but a deeper, transferable confidence in reasoning about software that could shape how we teach, learn, and verify for years to come.

Note: The study was conducted by researchers from Carnegie Mellon University and universities in Portugal, including the University of Porto and IST Lisbon, with lead author Carolina Carreira guiding the project. The work examined how students interact with a custom ChatGPT interface while solving Dafny verification tasks, and it highlights concrete strategies educators can use to integrate AI into formal methods courses without eroding learning.