The wave of big language models has carried with it a lot of bravado about what machines can do—and how quickly they can do it. But in classrooms and law offices alike, a quieter question has been gathering steam: can artificial intelligences judge complex, open‑ended work the way humans do? A team from Maritaca AI in Campinas, Brazil, led by Ramon Pires, Roseval Malaquias Junior, and Rodrigo Nogueira, set out to test this idea against a real, high-stakes standard: the Brazilian Bar Examination (OAB). Their study isn’t just about whether an AI can imitate a lawyer’s answer. It asks whether an AI can reliably grade other people’s reasoning in a domain where nuance, structure, and argument matter as much as factual knowledge.
The researchers built something they call oab-bench, a benchmark drawn from three recent editions of the OAB written phase. It contains 105 questions spread across seven areas of law, paired with the same official evaluation guidelines that human examiners use. The aim isn’t to produce another flashy score for a single model, but to create a public, update‑friendly test bed where models can be evaluated on open‑ended legal writing—essays, documents, and reasoned answers—just as a human examiner would. And crucially, the project doesn’t stop at “Can the AI produce a good answer?” It also asks, “Can the AI judge a good answer—and do so in a way that mirrors human grading?”
In the spirit of scientific curiosity, the authors push beyond simply testing models as students. They also pit frontier LLMs against humans as judges. Using an instructor‑like prompt framework, they enlist a strong judge model to score model responses and human responses according to the actual scoring rubrics from the exam’s guidelines. The results are striking enough to make you pause: a top model, Claude‑3.5 Sonnet, not only achieved the highest average across the board, but also demonstrated scoring behavior that closely tracks human grading in many cases. The work is a reminder that behind every clever answer, there’s a more delicate question: can the machine’s evaluation of an answer be trusted as much as a trained human’s evaluation?
What’s more, the study anchors its findings in a concrete institutional setting. The OAB exam is organized by Fundação Getúlio Vargas (FGV) and is the gatekeeper to practicing law in Brazil. By tying oab-bench to real‑world guidelines and real exam content, the team anchors their work in a live ecosystem where legitimacy matters. The project makes a point of transparency: the data, the benchmark, the prompts, and the evaluation pipeline are publicly available, inviting others to test, challenge, and improve the approach. It’s a rare scientific moment where high‑stakes professional practice meets open science in a tangible way.
So what does this mean beyond the cleverness of “AI as judge”? It hints at a future where automated evaluation helps teachers and boards scale nuanced feedback, where students could get faster, more consistent assessments, and where the guardians of professional standards might use AI to triage and calibrate human graders with greater fairness. None of this would replace human experts overnight, but it signals a path toward more consistent, scalable, and evidence‑driven evaluation in domains where judging the quality of reasoning is the real challenge.