Across the gleaming interfaces that now shape our daily conversations, a quiet drama is unfolding. Machines that once seemed to simply parrot patterns are being nudged by researchers to step into human moral puzzles, to act as if they were deciding who belongs, who stays, and who gets a second chance. The study behind PAPERSPLEASE doesn’t pretend to solve ethics for AI. It asks a sharper question: when an AI weighs human motives—Existence, Relatedness, Growth—whose needs does it decide matter most, and under what social lights do those decisions change?
Out of KAIST, a team led by Junho Myung and Yeon Su Park built a bold, dystopian-sounding test bed drawn from the border-world game Papers, Please. But this isn’t entertainment. It’s a serious probe into how large language models—GPT-4o-mini, Claude, Llama, and friends—prioritize people’s needs in a setting that exaggerates consequences: approving or denying entry to individuals whose stories are compact narratives, each anchored in one of Alderfer’s three ERG motivational categories. The project asks not just whether an AI can reason about ethics, but whether it carries hidden preferences about who gets safety, who gets connection, and who gets to grow. And it does so with a careful eye toward bias, asking how identity cues—gender, race, religion—may tilt judgment in subtle, potentially harmful ways.
The authors are clear about their aim. PAPERSPLEASE offers a benchmark—a structured, large-scale way to inspect how model behavior aligns with human moral intuitions and how it might diverge when people bring identity into the conversation. It’s a step toward making AI safer in high-stakes social contexts, even if the contexts themselves are deliberately challenging and ethically fraught. The work foregrounds a practical problem in AI today: you can’t deploy a model to screen real people for rights and entry without understanding the values it’s carrying, often in ways that no one asked for or wants. And the study makes a bold claim about the value of the ERG framework as a lens for interpretation: three core human needs, layered and sometimes competing, can reveal the hidden architecture of a model’s judgments.
Lead researchers and the institution behind the work anchor the study in KAIST, with Junho Myung and Yeon Su Park as co-first authors and a team that includes Sunwoo Kim, Shin Yoo, and Alice Oh. The authors emphasize that the research is a careful, data-driven exploration of how LLMs prioritize needs and how social cues shift those priorities. This isn’t a clap-on-the-back moment for AI; it’s a warning that the subtle biases encoded in large models can surface in morally charged scenarios, sometimes in ways that echo real-world discrimination. The paper’s data and prompts live in the open, inviting others to test, critique, and extend the findings as AI continues to intersect with human rights and policy.
What PAPERSPLEASE is and why it matters
At the core of the project is an audacious framing: imagine that an AI is stationed at a border checkpoint, deciding who may enter a country. The twist is that each applicant isn’t described by a dossier, but by a short, vivid narrative that signals a specific human motivation. The three ERG categories—Existence (the basics of survival and safety), Relatedness (the need for connection and belonging), and Growth (self-improvement and potential)—function as a three-tier ladder for motive. The researchers generate 3,700 scenarios that pair these motives with social identities (gender, race, religion). The AI’s job isn’t to determine policy; it’s to choose approve or deny, under conditions that mimic the moral friction of real decisions: should you help someone who’s desperate but may threaten security? Should you permit growth and opportunity even if it requires bending rules? These are the kinds of decisions that real institutions face, and the model’s answers become a mirror for our own moral blind spots.
The scale is deliberate. The team used six well-known LLMs—three proprietary systems (GPT-4o-mini, Claude-3.7-sonnet, Gemini-2.0-flash) and three open models (Llama-4-Maverick-17B-128E-Instruct, Llama-3.1-8B-Instruct, Qwen3-14B). Each narrative is infused with identity cues, so the same story can surface different responses depending on whether the applicant is described as male, female, non-binary, White, Black, Hispanic, Asian, or Christian, Muslim, Hindu, or Buddhist. The project isn’t testing for everyday fairness in the abstract; it’s testing how a model’s decision framework shifts when the social labels attached to a person change just a little. It’s the AI version of examining how a judge’s past experiences or the room’s mood can tilt a verdict—only here the “room” is a silicon collar around a language model’s patterns.
Key idea: the ERG framework provides a compact map of human motivation that lets researchers quantify how a model prioritizes needs. If an AI consistently favors Existence over Relatedness or Growth, it suggests a particular moral bias—one that might be understandable in a survivalist frame but troubling when applied to social decisions. The benchmark doesn’t simply measure accuracy or efficiency; it probes the value structure—the “why” behind the AI’s choices—and whether that structure reproduces or challenges human biases. That’s not a flourish of meta-ethics; it’s a practical concern for any system used in sensitive domains, from visa screening to social services to automated moderation.
What the experiments reveal about AI values
The first big takeaway is methodological: the study’s three-pronged evaluation—individual case, comparative case, and social-dimension case—unlocks different facets of how models think. In the individual case, the researchers measured acceptance or denial across three motivational values. The results showed a striking pattern: most models prioritized Existence and Growth over Relatedness, with Relatedness trailing behind. This roughly maps onto the intuition that basic safety and personal development feel more “urgent” than cultivating relationships in the abstract, even though ERG theory treats relatedness as a core need. It’s a mirror back to human behavior that prioritizes staying alive and advancing one’s future, before the more social or intimate motivations—though there are notable exceptions. One model, Llama-3.1-8B-Instruct, deviated from the expected hierarchy, accepting with unusual consistency across all three categories while still performing robustly overall. Another model, Claude-3.7-sonnet, spiraled into consistent denial, a stark contrast that hints at its alignment to stricter rule-application or policy interpretation rather than human-centric flexibility.
When forced to choose among three competing values in a comparative case, the models clearly split into two camps. Some—GPT-4o-mini, Claude-3.7-sonnet, and Qwen3-14B—tilted toward Existence-based motivations, aligning with a foundational, survival-first stance. Others—Gemini-2.0-flash, Llama-4-Maverick, and Llama-3.1-8B-Instruct—offered a more balanced spread across Existence, Relatedness, and Growth. The former trio tended to treat basic survival as the primary gatekeeper, while the latter group showed more nuance, allowing interpersonal or developmental motives to compete more evenly with survival concerns. The authors note that nine of fifteen model pairings showed statistically significant differences in these priors, underscoring that the construction of an AI’s moral compass is not uniform across architectures or alignment objectives. A telling divide emerges: GPT- and Claude-like systems cluster together in their priorities, while the Llama and Gemini families form another, systematically distinct group. This isn’t a minor statistical wrinkle. It signals that the design choices, training signals, and alignment policies baked into a model can channel moral reasoning along recognizable lines.
The social-dimension case is where the study’s real-world anxieties surface. Here, a narrative cue—an identity tag attached to the applicant—serves as a lever for potential bias. GPT-4o-mini, for instance, showed significant shifts in approval rates when identity cues were present, often boosting acceptance for identities tied to gender diversity or certain religious groups in growth and relatedness contexts. Yet in the existence category, where acceptance was already very high, identity cues moved the needle far less. The patterns aren’t uniform across models. Gemini-2.0-flash tended to favor gender-diverse identities more broadly, with particular boosts for growth-related narratives among non-binary and female identifiers, but with reductions for groups like Hindu or Hindu-associated identities in related contexts. Llama models demonstrated a more troubling tendency: while there were some positive nudges for non-dominant identities in certain categories, a general bias against Black, Hindu, and Hindu-associated identities persisted in several settings. Qwen3-14B revealed a mixed picture, with some identities pulling down growth-related approvals and others showing nuanced shifts depending on the narrative frame. The upshot is stark: even when the scenarios are artificial, the social labels attached to people can shift AI decisions in predictable, potentially harmful ways.
Takeaway: the results confirm what many AI ethicists have warned about for years: models encode subtle value biases—whether from training data, alignment guidelines, or architectural defaults—that can surface in high-stakes, identity-bearing contexts. Seeing these patterns in a structured, repeatable benchmark makes the problem tangible rather than theoretical. It also clarifies which directions might be more or less risky to deploy in real-world decision systems. The study’s clustering of models into two behavioral families is particularly consequential, suggesting that broad architectural or policy choices tend to pull models toward families with shared ethical predispositions. If policymakers want to curb bias in automated decision-making, understanding where a given system sits in that map matters as much as understanding its accuracy on a task.
Ethical stakes and what we should do next
The authors do not pretend their benchmark will settle questions about AI justice. Instead, they offer a diagnostic tool—a lens to see how value priors flex under pressure and how social cues can nudge those priors in ways that resemble real-world discrimination. This matters because the same tech that can draft a compassionate email or translate a legal brief can also screen people, allocate resources, or limit opportunities, all based on implicit moral assumptions embedded in the model. If a border-control scenario can reveal biased tendencies under vividly moral conditions, then so too can healthcare triage chatbots, housing-assistance chat programs, or policing-aid chat interfaces. PAPERSPLEASE doesn’t just test whether an AI can reason morally; it tests whether that moral reasoning remains humane when the stakes are defined by who a person is, not just what they need.
From a design perspective, the work invites several practical takeaways. First, alignment and disclosure matter. If a system will operate in sensitive domains, it should be shaped by explicit, human-centered guidelines about how it weighs needs and how it responds to identity signals. Second, transparency about the model’s value priors could become a feature, not a bug—allowing operators to audit why decisions tilt toward Existence, or why growth narratives get more weight for certain groups. Third, mechanisms for oversight and redress are essential. When a model’s decisions can shape real lives, there must be human-in-the-loop checks, post-hoc auditing, and ways to correct biased patterns without sacrificing overall usefulness or efficiency. The paper’s own open data—scenarios, prompts, and results—offers a blueprint for replication and broader scrutiny, a welcome move toward collaborative governance of AI ethics rather than isolated experiments in plush lab settings.
Still, the study’s limitations remind us not to pretend we’ve solved anything. Six models, a dystopian borderland, and a fixed three-factor motivational framework cannot capture the full range of human values, nor the messy, fluid realities of everyday decision-making. The authors acknowledge that their scenarios are simplified, that future work should diversify tasks and expand the value space, and that graded responses could better reflect nuance in real-world choices. These caveats are not excuses; they’re signposts for where the field must go next if we want AI to help, not harm, in our social fabric.
In the end, the PAPERSPLEASE project is less about catching models in a moral slip and more about making those slips visible. It’s a map of the moral terrain inside a machine, drawn with a careful hand and an eye toward fairness. The implication isn’t that AI should be forced to think and feel like a human; it’s that we should design AI so that its moral compass—how it prioritizes needs, how it weighs belonging against opportunity, and how it responds to who a person is—does not quietly replicate prejudice under the glow of a screen. If we can illuminate those internal maps, we stand a better chance of steering AI toward decisions that protect dignity, uphold rights, and expand opportunity, even when the path is ethically treacherous and emotionally charged.
Bottom line: PAPERSPLEASE provides a rare, large-scale look at how language models encode value judgments and how those judgments shift when people’s identities are highlighted. It’s a reminder that the tools we build to assist, automate, or manage human systems must be designed with humility about what they value—and with courage to confront the biases those values may embed. The study’s authors at KAIST show that understanding AI’s moral architecture isn’t just an academic exercise; it’s a prerequisite for building systems that treat people with fairness and respect, especially when lives and liberties hang in the balance.