Can AI counselors truly understand human nuance?

When a therapist asks you to tell your story, you expect more than polite phrases and quick fixes. You want a partner who hears you—not just the words you say, but the hesitations you swallow, the silences you fear, and the values you’re trying to protect. A study published in 2025, led by Keita Kiuchi at the Japan National Institute of Occupational Safety and Health, brings that expectation into sharp relief for artificial intelligence. It’s not a critique of AI but a map: how far current Japanese-language counseling AIs can go, where they stumble, and what would need to change for them to become genuinely useful allies in mental health and workplace well-being.

The researchers built a microcosm of a counseling session—three roles that can exist in any therapy trade: counselor AI, client AI, and evaluator AI. They ran these roles in Japanese and used a standardized yardstick known as the MITI, the Motivational Interviewing Treatment Integrity coding manual, to measure four core behaviors plus an global impression. The goal wasn’t to prove AI could replace human counselors tonight. It was to establish a benchmark: in non-English settings, with real human evaluators and rigorous metrics, can AI counseling hold together the nuanced dance of listening, reflecting, guiding, and collaborating with a person who is, in fact, a person?

Behind the study stands a consortium of Japanese institutions—Kiuchi and colleagues from the National Institute of Occupational Safety and Health, Kaze to Taikyo, Saga Occupational Health Association, Zikei Hospital and its psychiatry institute, Suzuka University of Medical Science, Ritsumeikan University, and the National Defense Medical College—driven by a common curiosity: how close are AI counselors to human partners in a culturally specific, language-rich setting? The answer, for now, is nuanced. The team shows real progress when AI counselors follow structured dialogue prompts, but they also highlight stubborn gaps in emotion, tempo, and authentic collaboration that no one should overlook when shaping tools intended for real people.

A Japanese MITI Benchmark for AI Counseling

To break the problem into something researchers could measure, the team created three counselor AIs. One used GPT-4-turbo in a zero-shot mode, another used GPT-4-turbo with Structured Multi-step Dialogue Prompts (SMDP), and a Claude-3-Opus version also employing SMDP. On the other side of the screen, client AIs played work-related personas—carefully demarcated by age and sex—to simulate how different clients might present themselves in a session. A trio of evaluation AIs—o3, Claude-3.7-Sonnet, and Gemini-2.5-pro—then judged the dialogues with MITI-style ratings, mirroring how human experts would assess a real conversation.

The design isn’t just clever; it’s essential. MITI evaluates a counselor’s ability to cultivate change talk, soften sustain talk, establish partnership, and show empathy. It also includes an overall impression. If you’re trying to build AI that can sit across a desk from a person and help them reflect, decide, and act, MITI is a way to quantify how well the AI sustains a human-centered rhythm rather than becoming a robotic “advice machine.” The researchers had 15 human experts, all with substantial counseling experience, review three counselor-client script variants for each client profile. They then compared those human judgments to the scores from the three evaluation AIs. The result is a layered portrait of capability and bias across language, culture, and model design.

One notable finding: when the counselor AIs used SMDP prompts—the Listen-Back-1 and Listen-Back-2 steps followed by purposeful questions—their performance rose across all MITI global ratings. The improvement was robust and did not hinge on a single model. In other words, the act of guiding the AI to follow a structured, ethically careful dialogue beat simply telling it to “act as a counselor.” This is a powerful reminder that prompting strategy matters just as much as the underlying model in sensitive social tasks.

What the biases reveal about AI minds

The study uncovered three kinds of biases worth noting. First, the three evaluation AIs, while broadly in line with human ratings on some measures, tended to overestimate the counselor’s overall quality and specifically the global impression (OVR) and Sustain Talk (SST). In practical terms, the AI judges tended to reward politeness and surface-level engagement even when deeper empathetic engagement or value-based exploration remained shallow. This isn’t a critique of the models so much as a reminder that different evaluators—humans and machines—apply different lenses when judging the same dialogue.

Second, the study found model-specific “personalities.” Gemini tended to emphasize power-sharing and client autonomy; Sonnet leaned toward emotional expression and supportive affirmation; o3 accentuated technical proficiency and structured dialogue. These biases weren’t just abstract quirks; they influenced the scores in systematic ways. It’s a cautionary note for teams designing or combining AI counselors: the “personality” of a model will shape what it values in a session, which in turn affects the kind of help it offers and how it’s judged by others.

Third, the client AIs showed a stubborn lack of emotional range. They offered responses that were compliant and predictable, with limited backchanneling or resistance—a mismatch with how real humans sometimes push back, explore, or test boundaries in therapy. The authors call this a reliability-issue in realism that’s common in AI role-plays. In effect, the AI clients behave like well-behaved co-authors rather than living, breathing humans who oscillate between doubt, fear, anger, and moments of clarity. The mismatch matters: if the client side of the dialogue isn’t realistic, the counselor AI is training on a flawed cantata rather than the complex music of a real session.

The paper doesn’t pretend these biases disappear with a few tweaks. It’s honest about the gaps and clear about where to focus next: retrieval-augmented generation (RAG) to help counselors pull in richer background content, more nuanced persona settings for clients, and fine-tuning that aligns AI evaluation with the deeper, higher-order aspects of empathy and collaboration that MITI seeks to measure.

Real-world implications and a cautious optimism

The implications stretch beyond a single lab or a single language. The study’s Japanese framing matters because mental health tools travel badly if you ignore culture and language. Counseling is not a generic “talking cure”; it’s a practice that relies on trust, warmth, nuanced listening, and the ability to co-create a path forward with the client. The researchers are careful to frame AI as an amplifier for human counselors, not a replacement. The findings suggest a pragmatic path: we can build AI counselors that are genuinely helpful in some contexts, but to reach broader applicability, we need to improve realism in client simulations, expand the emotional bandwidth of the AI, and align evaluation criteria with the subtle, value-laden goals of therapy.

From a training standpoint, the study offers a blueprint for AI-assisted counseling education. SMDP prompts, which encode reflective listening and a disciplined flow from problem to solution, appear to be a crucial step toward better practice. For training programs, that means designing curricula around structured dialogue schemas, paired with feedback systems that not only count reflections and questions but also evaluate the quality of empathy and collaboration. For developers, the lesson is equally direct: prompts aren’t cosmetic; they’re architectural. If you want an AI to participate in sensitive, long-form conversations, you must give it a disciplined conversational spine.

The cross-linguistic and cross-cultural dimension is equally revealing. The authors argue for culturally sensitive AI mental health tools that aren’t mere translations of English-language systems. This is more than a translation problem; it’s a design challenge: how do you honor culturally specific norms around power dynamics, decision-making, and emotional expression while still delivering effective support? The study’s results imply that the best-performing systems in one language may look quite different in another, and that means a global health tech strategy must embrace localized development and evaluation pipelines.

Roadmaps, risks, and the future of AI-assisted counseling

What would it take to close the gap between AI counseling and human practice? The paper’s discussion points to several concrete paths. First, more realistic client simulations are essential. The authors highlight the need for dynamic persona settings that capture office politics, family context, and realistic emotional arcs. That’s not merely “more realistic talk.” It’s about friction and resilience—the moments when a client challenges a counselor, or a client hesitates before sharing a painful memory. Second, retrieval-augmented generation (RAG) and targeted fine-tuning can help AIs pull in the right experiences, examples, and evidence to deepen reflective moments without veering into over-prescription or banality. Third, multi-agent architectures could let different AI systems contribute different strengths in real time, guided by reasoning models that keep sessions focused, ethical, and human-centered. Fourth, training in MI techniques—like Ask–Offer–Ask—could reduce the risk of premature advice-giving and promote genuine collaboration. These are not small tweaks; they’re a rethinking of how AI tools sit inside a therapist’s practice and a client’s experience of care.

There are also non-technical risks to anticipate. Transparency about AI involvement, privacy protections, and clear boundaries about when a human should step in are essential. The study is careful to note that its experiments used AI-to-AI interactions rather than real patients. Real-world deployment will demand rigorous ethics, clinician oversight, and robust regulatory guardrails. Still, the path forward is exciting. The work demonstrates that AI can be scaffolded to support counseling practice in meaningful ways, not by simulating human beings perfectly but by augmenting human caregivers’ capacity to listen, reflect, and empower clients to move forward.

In the end, the study offers a compact, human-centered insight: AI agents can show up with warmth and competence in a Japanese counseling context, provided they’re guided by careful prompts, tuned for cultural nuance, and evaluated with tools that recognize the depth of a real conversation. The researchers’ verdict is cautiously optimistic: we’re on a track where AI can enhance training and supplement counseling, but the ethical, cultural, and psychological work remains a human journey that machines can support—and occasionally illuminate.

Key takeaway: Structured prompting and careful evaluation make AI counselors more competent, but true depth requires richer emotions, more realistic clients, and culturally aware design. That’s not a punchline about AI taking over therapy; it’s a reminder that human-centered care still needs humans—and AI can help humans do it better, one careful conversation at a time.