Can a robot teach itself through tree-guided planning

Disembodied AI has made dazzling leaps, but real-world robots still stumble on long, multi-step tasks. A robot might need to locate a tomato, heat it in a microwave, and then place it in a sink basin. Each subgoal depends on the previous one, and a misstep early on can derail the entire plan. That kind of long horizon is where embodied intelligence earns its keep or breaks.

SEEA-R1, created by researchers at the Beijing Innovation Center of Humanoid Robotics and the State Key Laboratory of Multimedia Information Processing at Peking University, is a bold attempt to endow embodied agents with the ability to grow smarter by teaching themselves. The team, led by Wanxin Tian and Shijie Zhang, with Jason Ju guiding the project and Shanghang Zhang and Jian Tang as corresponding authors, built a loop that makes the agent improve its own reasoning and actions over time.

Two challenges loom large in reinforcement fine tuning for embodied tasks: rewards that are sparse and delayed, and hand crafted rewards that fail to generalize. SEEA-R1 tackles both by turning long chains of decisions into dense, usable signals through a tree guided approach, and by learning a reward model that can estimate progress across different tasks and environments. The result is a self-evolving agent that can plan, learn, and adapt with less human handholding.

A framework for self-evolving embodied agents

At its heart, SEEA-R1 runs in two almost ritual cycles. In Data Evolution the agent interacts with its world via Monte Carlo Tree Search to build trajectories that reveal not only which actions work, but how promising each step is. The tree turns the end result into a cascade of intermediate rewards, giving learning signals at every bend in the plan rather than only at the finish line.

In Model Evolution the agent updates two models: the Policy Model that actually chooses actions, and the Reward Model that judges how well a sequence did. The Reward Model is a Multi-modal Generative Reward Model built on a multimodal language model; given the agent’s history it outputs a structured verdict and, crucially, a reasoned trace. This makes the rewards less brittle and more transferable across tasks and settings.

These two engines feed each other in a closed loop: better data yields smarter policies, and smarter policies generate better data. The SEEA-R1 team demonstrates that self-evolution is not a magical leap but a disciplined cycle of exploration, evaluation, and refinement.

Turning sparse rewards into learning fuel

Tree-GRPO is the core engine that densifies the learning signal. It folds Monte Carlo Tree Search into a policy optimization routine so that the agent can reason about multiple steps ahead while still receiving gradient-based updates. Instead of only learning from a win or loss, the agent sees a reward associated with branches of the decision tree—so it learns which micro-decisions push a plan toward success.

This approach addresses the credit assignment problem that haunts long horizon tasks. When you are heating a tomato, there are many actions before the outcome is known; Tree-GRPO assigns credit to each node along the path, so a misstep in the middle doesn’t obscure the final result. The policy update uses a careful balance: the advantage signal is combined with a KL penalty to keep updates stable as the agent explores.

On ALFWorld, SEEA-R1 beat prior approaches by wide margins. The experiments used 30 search iterations per tree, with up to five candidate actions explored at each leaf, and a cycle of data collection followed by model refinement. The result was a planner that could chain subgoals across rooms, appliances, and time, improving its decisions across long sequences. When compared with methods that mix MCTS with other learning signals, Tree-GRPO consistently demonstrated higher success rates and needed fewer steps to finish tasks.

A reward model that travels with you

MGRM reframes rewards as something the agent learns to predict. The Multi-modal Generative Reward Model ingests the agent’s entire history across perception and action and outputs one of three categories: success, continue, or failure. It also provides a brief, human-readable rationale for its verdict, delivering a form of interpretability that is rare in RL-based embodied systems.

Crucially, MGRM is trained from the agent’s own evolving data, not from hand-tuned simulators. This means the reward signal can adapt as the environment changes, enabling the agent to self-evolve without constant human scripting. In the ALFWorld unseen tasks, SEEA-R1 with MGRM-based rewards achieved strong textual performance and competitive multimodal performance, even when ground-truth rewards were removed from the loop. These results point to a future where reward estimation itself becomes part of the learning loop.

Of course there are limits. The paper documents challenges with using world-model feedback as environmental signals, noting that current world models can hallucinate or misinterpret tasks, slowing data collection and increasing risk. The authors envision scaling up to bigger models and more diverse environments, but acknowledge that bridging from simulated kitchens to real homes remains a hard barrier. Still, the core idea stands: let the agent learn to learn, and let its own learning signal sharpen with experience.

SEEA-R1 does not claim to have solved embodied AI, but it sketches a viable path toward agents that grow smarter by walking their own reasoning journeys. In doing so, it reframes training from a one-way transfer of human wisdom into a living planner, into a feedback loop where the planner and the reward predictor co-evolve. The work comes from the Beijing Innovation Center of Humanoid Robotics and PKU, with Wanxin Tian and Shijie Zhang as co-first authors, Jason Ju as project leader, and Shanghang Zhang and Jian Tang as corresponding authors. If this line of research continues to mature, we might see robots that not only perform tasks but continually hone the very habits that let them perform better tomorrow.