The world of large language models (LLMs) has been moving so quickly that it can feel like riding a wave while blindfolded—you sense the motion, but you’re not always sure what the water is made of. What if the truest test of their intelligence isn’t how smoothly they can spit out facts or mimic human chatter, but whether they can understand and manipulate the tangled web of relationships that underpins real knowledge?
That question sits at the heart of a new study from researchers spanning The Hong Kong Polytechnic University, Beijing Institute of Technology, and The University of Hong Kong. The team, led by Chi Chiu So, tests whether three cutting-edge LLMs can perform deep relational reasoning—the kind of multi-step, structured thinking that humans use to deduce family ties, navigate graphs, or reason about how a chain of connections leads from one idea to another. In other words: can a machine think through relationships the way you think through a family album or a map of a city’s transit lines?
What the authors call a “deep reasoning” capability is not just about solving a clever puzzle. It’s a litmus test for the core competence many researchers see as essential for artificial general intelligence: the ability to reason about relationships across many steps, across multiple entities, and across different kinds of structures. The paper doesn’t claim to solve AGI, but it offers a rare, careful glimpse into what current models can do when the task requires more than pattern-matching or arithmetic. And it does so with a disciplined, benchmark-driven approach that invites both awe and caution.
What the study actually tests
To probe relational reasoning, the researchers designed two kinds of benchmarks that feel almost architectural: family tree reasoning and general graph reasoning. In the family tree test, the task is to deduce high-level relationships like HasSister, IsGrandson, IsAunt, and IsPaternalGreatAunt from a web of basic kinship facts (mother, father, son, daughter). In the graph test, the challenge is to infer connectivity and shortest-path properties in a network of nodes connected by directed edges. The setup is deliberately simple in its language—after all, the point is the logic, not the poetry of a problem statement—but the reasoning required is anything but trivial. The researchers emphasize that these two benchmarks are designed to stress multi-step, structured inference in zero-shot prompting (no few-shot examples provided).
To turn the logical puzzles into something an LLM can tackle, the team translates the relational data into a matrix of Boolean facts. From the raw relations, they derive higher-order predicates, and then they prompt the LLMs to reason in natural language while returning a JSON-like representation of a matrix that encodes the target relation. In effect, the model is asked to reveal its reasoning and then produce a structured verdict. The system uses a carefully crafted prompt to encourage the model to lay out reasoning steps and to format the output as a matrix. The process is not about programming a solver into the model; it’s about inviting the model to think through a chain of inferences and then demonstrate the outcome in a machine-readable form.
The study pits three models against each other: DeepSeek-R1, a 671-billion-parameter behemoth touted for its deep reasoning capabilities; a smaller DeepSeek-V3; and GPT-4o, a leading GPT-series model. DeepSeek-R1 is particularly notable because its training and architecture emphasize long, structured chains of thought (CoT) and a strategy of planning, verification, and stepwise deduction. In other words, it is built to be a reasoning machine with a notebook, not just a parrot of statistical patterns. The other two serve as strong baselines: capable, but not designed around long CoT or explicit deep reasoning mechanics.
What DeepSeek-R1 does right and where it stumbles
Across both benchmarks and across problem sizes, the results settle into a clear pattern. DeepSeek-R1 consistently outperforms the other models on tasks that demand deep relational reasoning, especially when the problems are modest in size. In the family-tree tests, it reaches top scores on several relations that require chaining multiple steps of deduction—the kind of work that looks easy on the surface but is, in fact, brittle and error-prone for many models when the reasoning gets long. In the HasSister task, for example, DeepSeek-R1 runs away with high accuracy early on, while its contemporaries lag behind.
The most striking win is in IsGrandson(x, y) within the family-tree domain. At smaller scales (n = 10 and n = 20), DeepSeek-R1 hits strong F1-scores—often above 0.9—demonstrating a robust ability to connect the dots across generations. The model’s performance tethers to what the paper describes as a long Chain-of-Thought (CoT): a deliberate, multi-step reasoning trace that explains how the model builds its conclusions. It’s tempting to see this as the model “thinking out loud” in a way that resembles human reasoning, and the researchers illustrate that with vivid excerpts showing planning, mapping of parent-child relationships, and verification steps.
But the paper also treats the results with necessary caution. When the problem size grows (n = 40), all models begin to falter, and in several targets—IsGrandson, IsAunt, IsPaternalGreatAunt—the F1-scores collapse to zero. The authors pin this to a practical bottleneck: a token-length limit. Even the strongest deep-reasoning model can’t fit the longer, more elaborate reasoning traces and the larger output matrices into its context window. In short, the architecture’s core constraint—how much text it can chew and hold at once—becomes the bottleneck that curtails performance at scale.
What does this say about the “depth” of the reasoning? The paper’s qualitative look at DeepSeek-R1’s internal traces reveals a genuinely organized approach: a planning phase, a staged build-out of relationships, and a verification loop that checks intermediate results before producing the final matrix. Yet the synthetic artifacts of that tracing also show potential cracks. The researchers note that some intermediate steps appear unstructured or partially coherent, hinting that the underlying reasoning rationale may not always be fully sound. It’s a reminder that long CoT can resemble human reasoning in form, but it does not guarantee unassailable logical justification behind every step.
On the graph side, the results echo the same trend. At n = 10 and n = 20, DeepSeek-R1 again leads the field on Connectivity and Shortest, while the other models show more variability. By n = 40, the same force majeure—the token limit—reappears: even DeepSeek-R1’s advantage shrinks, and performance across all models sinks toward a plateau of uncertainty. The takeaway is not that these models cannot reason about relationships; it’s that scaling complexity while maintaining a coherent internal narrative is a constraint that current architectures still struggle to surmount.
There’s a broader, almost philosophical takeaway in the data. DeepSeek-R1 demonstrates that with the right architectural emphasis—long CoT, explicit planning, and reinforcement learning-guided reasoning—the model can perform structured, multi-step inferences that look impressively human. But the same study makes it clear that the current generation’s strength is tightly bounded by practical limitations. A digital mind can trace a path through a labyrinth of relationships, but when the labyrinth grows too large, its map must be trimmed to fit. That trimming is not just a matter of computing power; it’s a constraint baked into how the models process sequences of tokens and produce output.
Why this matters beyond the lab
Three threads braid together when you step back and think about the implications. First, there’s a practical one: if you want AI systems that can reason about complex structures—genealogies, legal trees, supply chains, or knowledge graphs—the ability to hold and manipulate long, multi-step reasoning traces matters. The study shows that DeepSeek-R1 plausibly approaches a level where it can perform such tasks reliably at modest scales, and it highlights what currently prevents it from scaling—namely, token budgets and output completeness. In real-world terms, a system that can plan steps, verify them, and still finish its reasoning within a user-specified limit is closer to usable reasoning software than a model that shrugs off complexity with a single, confident token flood.
Second, there’s a cautionary tale about what “deep reasoning” means in practice. The authors do not claim an unambiguous, fully transparent chain of thought that you can audit and trust without caveats. The long CoT traces can illuminate the model’s approach, but they can also harbor missteps that are hard to detect without meticulous, human-level scrutiny. If we want AI to be a dependable partner in domains like law, medicine, or engineering, we’ll need more than shiny reasoning traces; we’ll need robust methods to validate those traces and to detect when the reasoning path deviates from sound logic. That’s not just a technical challenge; it’s a social and governance one—how to balance transparency with the risk of exposing the model’s internal heuristics to misuse or misunderstanding.
Third, the study gestures toward a future where multimodal reasoning could be a game changer. The authors point out that visual or spatial representations of relational data—think diagrams or family trees drawn out as graphs—might help LLMs reason more robustly. It’s a hopeful nudge toward a future where we don’t rely solely on textual prompts to coax deep reasoning from machines. If a model can see a diagram and reason about it in concert with text, we may unlock more resilient and scalable forms of inference. The paper’s authors even call for more systematic exploration of reasoning failures, which is code for a research program: build better models, test them in more varied, messier real-world tasks, and learn from the wrong answers as much as the right ones.
What does this mean for you and me? For the curious reader who wants to know how far AI has come, the takeaway is both hopeful and sobering. Hopeful because we’re seeing concrete, measurable progress in a realm that used to be the exclusive province of human cognition: structured, multi-step reasoning about relations. Sobering because the progress isn’t a straight line toward infinity. The real world will keep throwing longer chains of reasoning at models, and token budgets, output completeness, and the need to validate internal reasoning will keep being the gatekeepers. If you’re excited by the idea of AI that can map out a family tree, navigate a complex network, or parse a dense set of relationships in a legal brief, this paper is a clear signal that we’re moving in that direction—and also a clear reminder of where the road ends for now.
Finally, the authors’ call to action is worth noting. The work is more than a novelty demonstration; it’s a structured, empirical stake in the ground. The researchers argue for more work on multimodal reasoning, for nuanced examination of long Chain-of-Thought processes, and for careful study of where and why reasoning fails. They also commit to open science by promising to release code on GitHub. That transparency matters. It invites replication, critique, and improvement, which is how science advances when the subject is as philosophically charged as machine thinking.
The study’s authors present their work as a stepping stone toward more robust, interpretable reasoning in AI. If you read their paper as a map, it points to valleys of potential cultural and technical breakthroughs—systems that can, in the right contexts, understand complex relationships as deftly as a human genealogist or an investigative journalist. It also marks the edges of what we can reasonably expect from today’s architectures without a concerted push toward new paradigms. Either way, the trajectory is clear: we’re moving from surface-level language mimicry toward genuine relational intelligence, with all the promise and all the caveats that come with such a shift.
In the end, the study is a reminder that intelligence—digital or organic—is not a single trick but a suite of capabilities working together: memory, planning, verification, and the disciplined handling of complexity. DeepSeek-R1 shows what it looks like when those pieces click in concert, at least for mid-sized problems. The rest of us get a front-row seat to an ongoing experiment about how far we can push machines to understand the world through the tangled relationships that knit everything together.
Notes on the study: The work is a collaboration among The Hong Kong Polytechnic University, the Beijing Institute of Technology, and The University of Hong Kong. The lead author is Chi Chiu So, with co-authors Yueyue Sun and Jun-Min Wang, Siu Pang Yung, Anthony Wai Keung Loh, and Chun Pong Chau contributing as well. The research highlights the DeepSeek-R1 model and compares it to DeepSeek-V3 and GPT-4o, using two zero-shot benchmarks—family tree reasoning and general graph reasoning. The paper emphasizes token-length limitations as a primary bottleneck for scaling deep relational reasoning and calls for future work in multimodal reasoning and deeper analysis of long Chain-of-Thought processes. The authors also note that the code will be made publicly available on GitHub.