When AI learns to translate math into formal proofs

What autoformalization is and why it could matter

Mathematics is a language with two dialects. The everyday speech we use to describe a problem is not the same as the precise, machine-checkable language used by proof assistants such as Lean, Coq, or Isabelle. Autoformalization is the project of teaching a computer to translate a natural language math problem into a formal statement that a proof assistant can verify. It sounds like translating between two humans, but the stakes are different: a misspelled symbol or an ambiguous variable can turn a valid argument into a failed proof, or worse, into a proof that looks credible but is actually wrong.

Recent advances in large language models have fueled optimism that we can automate this translation. The new paper from StepFun and collaborators pushes this forward by asking not just for a faithful translation but for translations that a computer can prove correct with rigorous checking, no handholding required. The authors argue that a successful autoformalizer needs two things: a comprehensive grasp of how formal languages express mathematical ideas and a robust ability to map everyday problem statements into those formal objects. Without both, a system can misidentify the objects involved or misinterpret the problem structure, leading to errors that a human would catch but a machine might not.

Behind the work are researchers from the SKL of Processors at the Institute of Computing Technology, Chinese Academy of Sciences, with inputs from the University of Chinese Academy of Sciences, the University of Science and Technology of China, and StepFun Inc. The project was led by Yutong Wu, with a team that includes several coauthors across these institutions. Their aim is not just to build a better translator, but to teach a model to reason about how to translate, in a way that aligns informal intuition with formal precision. If successful, the impact could ripple through automated theorem proving, formal verification of software, and even education, by making formal reasoning more accessible to people who think and work in ordinary language first.

Highlights: The paper argues for two core abilities in an autoformalization model. First, a deep, near encyclopedia-like mastery of the formal language used for proofs. Second, the ability to reason through informal problems and lay down a clear path from everyday description to a formal statement that a proof assistant can check. The authors propose a data driven pipeline called ThinkingF to cultivate both abilities at once.

ThinkingF: a recipe for fusing formal knowledge with informal reasoning

Think of ThinkingF as a two‑engine mixer for AI math brains. One engine builds up formal knowledge, the other teaches the model how to bridge informal language and formal objects. The authors identify two bottlenecks that have held back autoformalization in the past. Without formal knowledge, a model might fail to spot the right Lean definitions or mislabel a mathematical object. Without solid informal-to-formal reasoning, it might misinterpret what the problem is really asking or skip steps that would be obvious to a human preparing a formal statement.

To address these, ThinkingF weaves together two data streams and two training stages. The first stream distills a vast corpus of formal knowledge from specialized systems into a large pool of informal-formal pairs that a generalist LLM can learn from. The second stream builds reasoning trajectories that show how to go from an informal problem to a formal Lean statement, including the typical splits of the problem and the mapping of everyday mathematical concepts to Lean constructs. The authors emphasize that this second stream is template guided, not just copied from a general purpose reasoning model. The idea is to avoid off task reasoning that does not naturally translate into formal steps, a problem they observed in prior work when naive distillation let a general model chase its internal reasoning far away from the formal task at hand.

Once these data streams exist, they undergo a two‑stage fine tuning. First, a strong generalist model is fine tuned on both the formal knowledge data and the reasoning trajectories. Then a reinforcement learning phase uses a verifiable reward signal: whether the model’s formal statement can be proven equivalent to a trusted ground truth using Lean 4’s proof machinery. This beq) style equivalence check, while computationally heavy, provides a crisp, objective target for the model to optimize. In effect, ThinkingF teaches the model not only to translate but to calibrate its translations against formal proof reality.

The two pillars that make the approach credible

The first pillar is domain knowledge in the target formal language. The authors construct a formal knowledge dataset by running a specialized autoformalizer on large public math problem sets and then filtering with a few quality gates. The result is a few hundred thousand high‑quality informal-formal pairs. The second pillar is reasoning data that captures how an informal problem can be decomposed and reassembled into a Lean‑friendly formalization. Here the team uses templates that guide the model through problem understanding, concept analysis, and the mapping to Lean 4 objects, followed by a final formal code block. This deliberate, template guided approach helps prevent the kind of misalignments that plague more unstructured attempts at translation.

Crucially, the authors do not stop at simply generating translations. They embed a reasoning trace inside the model outputs in a structured format, and train the model to produce both the formal statement and its reasoning steps. This dual output mirrors how human mathematicians think: we outline the plan before writing the formal proof. The philosophy is to cultivate a model that can show its work in formal language as clearly as it can spell out the informal problem in natural language.

What the benchmarks reveal about capability and limits

The authors test their approach on three well known formal mathematics benchmarks: FormalMATH-Lite, ProverBench, and CombiBench. They measure BEq, bidirectional extended definitional equivalence, which asks whether the model’s formal statement is semantically equivalent to a ground truth using a formal proof system. BEq@1 is a one shot verdict, BEq@16 allows up to 16 attempts for equivalence. This setup gives a nuanced view of both single turn accuracy and the model’s potential when given opportunities to revise or retry.

The results push the state of the art. The StepFun-Formalizer in the larger 32B size achieves BEq@1 scores around 40.5 percent on FormalMATH-Lite and 26.7 percent on ProverBench, outperforming both broad generalist models and prior specialized autoformalizers. A smaller 7B variant still beats many competitors on both benchmarks, offering a compelling balance of performance and compute. These gains are not trivial; they indicate the model is now capable of producing formal statements that are not just syntactically correct but are semantically aligned with human ground truths when checked by an automatic prover.

One of the clearest messages from the ablation studies is that the reasoning data—the template guided informal-to-formal trajectories—plays a crucial role. Removing the reasoning dataset causes a sharp drop in BEq@16, i.e., the model’s upper bound, while removing formal knowledge also hurts but to a lesser extent. In other words, the ability to reason about how to map informal statements into formal steps is the engine that drives the model toward verifiable formalizations, and the domain knowledge provides the scaffolding the engine needs to operate correctly.

Beyond the numbers: what this changes about building math AI

If this approach scales, it could alter how we build tools that assist in formal mathematics and software verification. A reliable autoformalizer can feed into automated theorem provers to generate testable formalizations from human mathematical ideas, then have those formalizations checked and refined by the prover itself. That could dramatically lower the barrier to using formal methods in research, education, and industry. Imagine students drafting a math problem in plain language, and a system that translates it into Lean statements with a traceable reasoning path, which a proof assistant then verifies step by step. The cycle from intuition to verified result would be closer to a single workflow than a patchwork of disconnected tools.

Of course the route is not without caveats. The authors document off task behavior in some general purpose reasoning models, where the model focuses its internal reasoning on solving the informal problem rather than on the formalization task. Template guided reasoning helps mitigate this pitfall, but it also signals a broader truth about AI alignment: giving a model a clear, domain‑specific scaffold can trump trying to generalize reasoning skills across domains. The research thus reinforces a pragmatic lesson for AI system design: where you want correctness and verifiability, you often want disciplined templates and explicit verification signals embedded in the training loop.

Another practical implication is the potential for cross‑pollination with code generation and software verification. Formal languages are not remote from real software practice; they underpin critical systems in finance, aerospace, safety‑critical software, and cryptography. A credible autoformalizer can accelerate the generation of correct formal specifications and even assist in generating verifiable code, if paired with code synthesis tools and robust proof assistants. The research hints at a future where natural language software specifications can be more reliably translated into verifiable code in a tightly coupled, end-to-end pipeline.

How humans and machines might co‑learn math in the near future

There is a human story tucked inside ThinkingF. It treats mathematics as a culture of practice, not just a catalog of theorems. The data synthesis strategy mirrors how human learners accumulate problem solving habits: we absorb large libraries of examples (the formal knowledge data) and we practice through problem solving templates that reveal common patterns of reasoning (the reasoning data). The model then learns to combine these two strands, much as a student integrates technique with understanding to produce original, correct work. It is not a mysterious leap of intelligence; it is a careful apprenticeship carried out at scale with machine memories and human oversight working in tandem.

And yet it remains a collaboration rather than a replacement. The BEq metric measures a particular kind of correctness, a form of symbolic equivalence verified by a machine, not the full spectrum of mathematical truth in the human sense. The authors are careful to acknowledge that verification in Lean is a strong standard, but not the final arbiter of mathematical truth in all contexts. Still, the progress is tangible: a system that can reliably propose formal statements that stand up to machine checking is a powerful companion for researchers who routinely push the boundaries of formal mathematics.

The study also reminds us that the hardest parts of AI math work lie not in clever ideas alone but in the disciplined craft of data creation and alignment. The ThinkingF pipeline is as much about engineering a reliable learning signal as it is about architectural cleverness. In a field where a single line of Lean code can make or break a proof, the care with which data is synthesized, filtered, and guided becomes a kind of patience embodied in silicon. It is the long game of enabling machines to think in the way humans would if we trained them to think with the discipline of proof assistants in mind.

The human story and the future outlook

The people behind this work anchored their project in collaboration across institutions known for math, CS, and AI tooling. The Institute of Computing Technology in the Chinese Academy of Sciences hosts the core research, with academic partners at the University of Chinese Academy of Sciences and the University of Science and Technology of China, plus the industry partner StepFun. The lead author, Yutong Wu, and the team push toward a future where mathematical reasoning and formal verification can be scaled up with the help of AI that does not merely imitate reasoning but learns to follow a disciplined formal workflow. The promise is not that AI will replace human mathematicians, but that it becomes a capable partner who can draft, check, and refine formal arguments at speed that complements human judgment and creativity.

As with any frontier tech, there are important guardrails to consider. The computational cost of rigorous BEq verification is nontrivial, and the field must balance the desire for faster iterations with the need for soundness. There is also the risk that models learn to game the verification process rather than truly understand the formal objects they manipulate. The authors address this with template guided reasoning and a verifiable reward signal, but the broader community will want to see how these ideas generalize to more complex formal languages and how resilient they are to adversarial formulations of problems. The research lays a strong foundation, but the next steps will test the durability of the thinking templates and the scalability of the data pipelines as math problems grow in complexity and diversity.

In the end, ThinkingF feels like a bridge being built between two civilizations: the warm, human world of informal problem solving and the exact, cautious realm of formal proofs. If the bridge holds, it could reshape how we teach, how we verify, and how we explore new mathematics in an era where AI can participate without pretending to be human. The study is a clear reminder that progress in AI is often not about a single leap of cleverness but about building robust scaffolds—data, templates, and verification signals—that let systems learn to reason with the kind of rigor that math demands.