Artificial intelligence that can reason about actions and change is a hunter for real-world autonomy. From robots gripping objects to delivery fleets plotting routes, the ability to plan is what keeps AI from merely reacting to the present and instead shaping a reliable path toward the future. A team at IBM Research—Harsha Kokel, Michael Katz, Kavitha Srinivas, and Shirin Sohrabi—has built a new benchmark called ACPBench Hard to probe exactly how well large language models (LLMs) can reason about action, change, and planning when pushed beyond simple, multiple-choice quizzes. Their aim isn’t to crown a winner in a vacuum; it’s to illuminate where today’s AI planners stumble and, crucially, why those gaps matter for the safety, reliability, and usefulness of autonomous systems.
ACPBench Hard is an ambitious extension of a prior benchmark that separated planning into atomic reasoning tasks. The old version tested whether a model could decide if an action was applicable, or what would happen when an action was executed, among other discrete checks. ACPBench Hard flips the script: it asks open-ended, generative questions. In other words, instead of choosing from a list, the model must generate precise, action-by-action answers that a symbolic planner would need to proceed. The researchers also add a fresh task, asking the model to propose an action that would move the state closer to a goal. All of this unfolds in the context of dozens of planning domains drawn from Planning Domain Definition Language (PDDL), which keeps the problems realistic and diverse rather than toy-like.
Why does this matter? Planning isn’t just an academic exercise; it underpins how AI navigates complexity in the real world. If an autonomous agent can’t reliably figure out which actions are possible, what happens next, or which subgoals must come first, it’s as if a navigator keeps steering into cul-de-sacs even when a clear, efficient route exists. ACPBench Hard targets the exact cognitive muscle that planners rely on: precise reasoning about action, change, and the consequences of sequences of steps. The results, as the authors show, reveal a landscape where even the reigning AI giants struggle to reason as a planner would—and where the gap between “language model” and “planning model” becomes not a philosophical riddle but a practical bottleneck for real systems.
What ACPBench Hard Tests and Why It’s Different
At the heart of ACPBench Hard is a deliberately lean form of reasoning about planning tasks. The dataset keeps the math and the logic abstract, steering away from image inputs or sensor fuzz. The eight tasks cover a spectrum of planning challenges: Applicability (which actions can be taken in a given state?), Progression (what changes after an action is applied?), Reachability (can a fact eventually become true?), Action Reachability (can an action become applicable in some reachable state?), Validation (is a given plan valid?), Justification (can a plan be simplified by removing actions?), Landmarks (which facts must occur on any plan?), and Next Action (what is the next move that nudges us toward the goal?). The new Next Action task is especially telling because it mirrors the real-life habit of planners to march through a sequence, one step at a time, toward an objective.
Rather than relying on tidy, fill-in-the-blank prompts, the ACPBench Hard dataset uses open-ended questions that demand a precise, generative answer. That means models must ground their responses in the same symbolic structures a planner would use—propositions, preconditions, add and delete effects, and the state transitions that tie them together. The researchers also designed validators—algorithmic checks that decide whether a given answer is correct—so that evaluation isn’t left to guesswork. In essence, the benchmark tries to separate the planning process into components, then tests whether current AI systems can reliably produce the components themselves. This isn’t just about producing one right answer; it’s about producing the right kind of answer consistently enough to feed a planner or policy module.
The test suite spans 13 planning domains drawn from real-world-looking problems, not contrived toy worlds. Each domain comes with a formal STRIPS-style description of facts, actions, and how those actions transform states. The result is a challenging cross-section: some domains are friendly, others treacherous, and some yield surprising patterns in model behavior. By embedding the tasks into domains with varied structure, the authors aim to prevent models from gaming a single layout and to reveal where general planning competence truly lies—or doesn’t lie—in current AI systems.
What the Experiments Uncovered About Our AI Partners
The paper presents a sobering portrait of progress. Across a wide spectrum of models, from smaller 8-billion-parameter sizes to hundreds-of-billions-and-beyond giants, no single model dominates every task. In fact, for the hardest tasks—atom reachability and action reachability—the accuracy stays stubbornly low across most models. The numbers aren’t tiny: many tasks sit below the 65% mark for most models, and a number of tasks hover in the 20s or 30s. This isn’t just a little gap; it’s a chasm between what a language model can do in freeform generation and what planning systems require to operate reliably in the wild.
On the surface, the results look oddly uneven. The “progression” task—the one that asks what changes when you execute an action—emerges as one of the easier challenges. Some large models reach into the 40s or even low 50s, while a few planning-oriented or “reasoning” models push higher, but none deliver universal mastery. Grammar, formatting, and stubborn adherence to format aside, the crucial takeaway is that even the most capable models struggle to generate the precise, ground-truth components that a planner needs to chain actions correctly. The researchers quantify this by showing that even the top performers in certain tasks fail to generalize across domains; a model might excel in a few settings but stumble in others, indicating brittle, domain-specific understanding rather than robust planning intelligence.
One striking detail is the performance gap between open-ended generative answers and more constrained formats like boolean or multiple-choice questions. GPT-4o—one of the biggest names in the space—drops noticeably when asked to generate, as opposed to select from a menu. The generative format increases model error rates compared to bool or mcq formats, by a substantial margin, though there are exceptions. This speaks to a deeper truth: producing human-readable, free-form reasoning that adheres to precise technical constraints is harder for models than clicking an option that looks correct. The validators help, but they can only do so much when the raw output drifts from the required structure or when the answer itself is subtly misaligned with the formal task.
Domain-by-domain, the landscape is uneven. Some settings—like the depot or grid-based environments—see relatively stronger performance from certain models, while others—alfworld or floortile—remain stubbornly hard across the board. The domain mix matters because it mirrors the heterogeneity AI must navigate in the real world: every industry has its own structure, constraints, and surprising corner cases. The results remind us that a one-size-fits-all AI planner is unlikely to arrive from a single, off-the-shelf model. Instead, progress will require a toolkit: models tuned for planning, robust validators, and hybrids that couple statistical reasoning with symbolic verification.
The researchers also offer a pragmatic lens on what to do next. If you want AI planners to be practical, you may not need the biggest model possible; you might instead pursue targeted training data for generative planning tasks, or craft prompts that elicit the exact kinds of structured outputs planners need, or build smaller, specialized models fine-tuned on planning data. The paper even hints at chain-of-thought-like approaches to generate intermediate reasoning steps that validators can check, a path that could bring a middle ground between free-form reasoning and strict formal verification.
Why This Changes How We Build AI That Maps the Future
The ACPBench Hard study reframes planning as a partnership problem: language models can contribute, but they must operate alongside symbolic reasoning and planning tools that enforce correctness. In other words, a competent AI planner might resemble a co-pilot more than a lone navigator—one who proposes candidate actions, then lets a separate verifier prune, validate, and sequence them into an executable plan. The validators described in the paper are not afterthoughts; they’re a central piece of the architecture, capable of scoring and filtering open-ended answers in a way that aligns with how planners would inspect a candidate solution.
What does this mean for the real world? Robotics, logistics, and autonomous systems depend on reliable planning under uncertainty and constraint. If AI systems can’t consistently identify which actions are available, or predict the consequences of a sequence, their deployments risk mistakes that are costly or dangerous. The ACPBench Hard findings imply that we should not count on raw LLMs to supplant symbolic planners any time soon. Instead, the path forward points toward hybrid systems: LLMs generate possibilities, while planners and verifiers enforce correctness, and a learning loop trains models to better anticipate planning-relevant constraints.
In practice, this could look like modular AI stacks where a language-model module supplies contextual understanding and natural-language interfacing, a planner module handles action sequencing with strict guarantees, and a validator module checks every step for correctness against the problem definition. The research also hints at more specialized directions: training data that captures the exact kind of open-ended questions planners face, and methods to render chain-of-thought reasoning in a way that a verifier can audit efficiently. The big lesson is not that AI planning is impossible, but that it is intricate enough to demand a careful blend of tools, rather than a single, monolithic block of intelligence.
The IBM Research team’s ACPBench Hard, with its rigorous evaluation framework and diverse domains, provides a map for this journey. It highlights the tasks that currently trip up even the strongest models and clarifies where improvements would yield the greatest practical leverage. The study also invites the AI community to think differently about evaluation: progress in planning isn’t just about higher overall accuracy, but about building systems whose components can be trusted to work together, step by deliberate step, in environments that matter to people and businesses alike.
Ultimately, the work acts as a honest thermometer for planning intelligence. It doesn’t pretend that bigger models alone will solve the problem; it asks us to consider how we build, test, and connect components so that AI can plan with the reliability humans expect. The trail map it provides—from eight core tasks to a robust validation architecture and a diverse set of domains—offers a concrete blueprint for the next wave of improvement. For a field racing toward autonomous, trustworthy systems, ACPBench Hard is less a verdict and more a compass pointing to where to aim next.
As a closing note, it’s worth naming the bench itself: ACPBench Hard is the product of IBM Research, with Harsha Kokel and coauthors steering the work. Their effort crystallizes a truth: progress in AI planning will be incremental, collaborative, and deeply technical, but it’s exactly the kind of progress that makes smarter, safer automation possible—one carefully validated step at a time.