Can AI Learn to Score Human Beauty and Why It Matters

When you scroll through a feed and pause at a portrait that feels just right, you’re tapping into a blend of lighting, pose, expression, and context that human eyes perceive in a fraction of a second. Now a team of researchers from Tsinghua University and Kuaishou Technology wants to teach a computer to do something similar—judge not just whether an image is pretty, but how the many layers of a human-centric photo add up to a single, nuanced aesthetic score. They’ve built a large, carefully annotated dataset and a multi-headed model that blends vision and language, aiming to quantify beauty in a way that’s both precise and interpretable. It’s a bold step toward a future in which AI can help curate, generate, and critique human-centered imagery with a sophistication that feels almost human.

The project, led by Zhichao Liao and colleagues, is a collaboration between Tsinghua University and Kuaishou Technology. It centers on a simple but ambitious idea: aesthetics aren’t a single dial you can twist to a number; they’re a tapestry woven from multiple attributes—facial brightness, body shape, clothing, the surrounding environment, and more. If you want an AI to assess aesthetics in a way that resonates with human judgments, you may need to measure each thread in that tapestry and also learn how those threads relate to one another. That insight is the core of HumanBeauty, the first large-scale dataset crafted specifically forHuman Image Aesthetic Assessment, and of HumanAesExpert, a foundation-model approach that uses an “Expert head” plus language and regression components to get both granular and global readings of aesthetics.

HumanBeauty: a dataset built for a human-centered taste test

The field of image aesthetics has long wrestled with datasets that either pool everything into one score or skim the surface of human perception. This new work starts with a confession and a challenge: existing image aesthetics data often underrepresent the human subject in a holistic sense—too many images spotlight faces or bodies in isolation, or rely on a coarse, one-dimensional score. So the team built HumanBeauty, a dataset of 108,586 annotated human images designed to support HIAA, or Human Image Aesthetic Assessment. It’s a two-pronged effort. First, they augmented 58,564 human-centered images culled from public aesthetic datasets, aligning their scores onto a common scale. Then they added 50,022 images they collected and annotated explicitly for twelve-dimensional aesthetic analysis, yielding a richly labeled resource that’s unusually comprehensive for this domain.

The twelve dimensions aren’t an afterthought. They’re a scaffold for a more stable, interpretable evaluation of human-centric imagery. The categories break down into three clusters: facial aesthetics (brightness, feature clarity, skin tone, structure, contour clarity), general appearance (outfit, body shape, looks), and environment (the setting around the subject). Each image in the 50K subset is annotated by a panel of nine raters, and every dimension is scored on a 0-to-1 scale, producing a mean opinion score, MOS, that’s directly comparable across images. All told, HumanBeauty represents a deliberate attempt to capture the facets that people notice when they assess a portrait or a candid shot, with an eye toward how those facets combine to form an overall impression.

The data work wasn’t a flash in the lab. The team applied a careful, iterative protocol to ensure quality and reliability. They used face-detection to ensure images included the subject’s face and body, filtered out content that would be inappropriate for wide audiences, and normalized scores from different source datasets to a single scale. The result is a resource that feels both scientifically robust and practically useful for downstream tasks in content curation and AI-assisted design.

In the background, the HumanBeauty project also tackles a broader problem in vision-language research: when you stitch together a large language model and an image encoder, you need training signals that capture both high-level judgments and fine-grained sensory cues. The twelve-dimension annotation is the kind of structured, human-friendly signal that makes those cross-modal models learn to reason about aesthetics in a way that’s closer to how we talk about beauty in ordinary life.

A three-headed brain for aesthetics: the HumanAesExpert model

If you’re used to thinking of AI aesthetics as a single score, the HumanAesExpert model will feel revelatory. The researchers argue that a single regression head or a purely text-based scoring head falls short when you want both precise numbers and human-like interpretability. Their solution is a three-headed architecture that works in concert rather than in competition.

First, there’s the LM head, the familiar workhorse of many multimodal systems. It translates the image’s features into textual prompts and rating labels, preserving the language-understanding strengths of large models. This is important because people think and talk about aesthetics with nuance, not just a single numeric value. Second, a regression head operates directly on the learned representations to produce a continuous overall score. This head benefits from the precision of regression training but can miss the subtleties that language can carry. Third, and most novel, is the Expert head. It’s a sparsely connected multilayer perceptron designed to mirror the relationships among the twelve aesthetic sub-dimensions. By training this head to predict the smaller, more granular scores and then feeding those into a broader synthesis, the model gains a structured, human-friendly map of what drives the overall rating.

The trio doesn’t work in isolation. The authors compound the three streams with a MetaVoter—a learnable system that blends the LM, Regression, and Expert heads into a final score. Think of it as a referee that weighs each head’s input according to what the situation demands: language-based reasoning for explainability, the regression path for numeric precision, and the Expert path for nuance grounded in the hierarchy of the 12 dimensions. The result is not another single-number predictor but a more robust, interpretable ensemble that can adapt to different kinds of images and evaluation signals.

Two-stage training anchors the approach. In stage one, the model is fine-tuned with overall annotations, letting the LM head and the regression or expert branches learn from the broad aesthetics signal. In stage two, the MetaVoter learns to balance the three heads, using scores from both the overall and the 12-dimensional annotations. The architecture is designed not just for accuracy but for interpretability: the Expert head produces explicit sub-dimension scores, while the MetaVoter produces a final prediction that reflects the model’s best blend of all three ways of knowing aesthetics.

The researchers report that this approach yields state-of-the-art performance on their dataset across standard metrics like MSE, MAE, and various correlation measures. In the experiments, the 8B version of the model consistently outperformed competing methods, and even the 1B variant showed meaningful gains. More than raw numbers, the result points to a practical takeaway: combining language-informed reasoning with dimension-aware expertise and a learned fusion strategy can produce more human-aligned judgments in a field as subjective as aesthetics.

Why this matters: what a machine’s taste could mean for culture and computation

The promise of a foundation model for aesthetic perception isn’t just about giving computers a taste for portraits. It’s about creating scalable, auditable, and tunable judgments that can operate in real-world systems where aesthetics matters—everything from feed ranking and ad design to AI-assisted photography and generative imagery. If a platform wants to surface content that fits a user’s nuanced preferences, a calibrated, interpretable aesthetic model could offer more than a blunt popularity signal. It could, in theory, weigh how much a user cares about facial expression versus environment, or how much the outfit contributes to an overall “looks” score, and do so in a way that’s explainable to humans.

But there’s a social dimension to this shift. Beauty standards are not universal; they’re shaped by culture, context, and personal experience. The HumanBeauty dataset includes images from diverse sources and emphasizes several dimensions that often surface in portrait culture, yet the risk remains that a model trained on a particular demographic or aesthetic ideal will reproduce or amplify biases. The authors are careful to release their data, models, and code openly, which is a double-edged sword: it accelerates progress and invites scrutiny, but it also invites scrutiny of how such a tool could be misused—whether to steer people toward certain looks, to police appearance in sensitive contexts, or to entrench narrow beauty norms in AI-driven content curation.

From a content-creation perspective, the work could become a powerful ally for designers and platforms. It could help editors decide which images best fit a brand’s tone, or guide artists and photographers in composing visuals with a measurable aesthetic vocabulary. For creators in AI-assisted workflows, a model that can explain its judgments dimension by dimension offers a kind of semantic transparency that blunt, single-number metrics cannot. It’s not about replacing human taste but about augmenting it with a machine that understands the texture of taste—how brightness interacts with contour, how environment whispers to looks, how outfit and body shape contribute to an overall impression.

The 12 dimensions as a window into how we see beauty

One of the most striking features of this work is not just the scores but the very structure of the scoring system. By decomposing aesthetics into twelve sub-dimensions, the researchers showcase how a portrait’s appeal can hinge on a handful of interacting factors: facial brightness modulates mood; feature clarity anchors recognition; skin tone can convey health and vitality; contour and structure shape perceived beauty; the outfit and body shape contribute to the style and narrative; and the environment gives context that can elevate or dull the whole composition. This decomposition mirrors how human judges often reason about images: we don’t give a single monolithic verdict; we tell a story about which elements pulled us in and which ones kept us at a distance.

The qualitative demonstrations in the study hint at a kind of emergent interpretability. In example images, the model’s 12-dimension outputs line up with human judgments in telling ways, sometimes revealing why a particular image scores poorly in facial brightness or body contour clarity. That interpretability—an explicit bridge from pixels to dimension-level explanations—offers a path toward AI that can be critiqued, refined, or tuned in ways that are rare for black-box predictors. If a platform wanted to improve a portrait’s aesthetic alignment, it might use such signals not just to rank images but to guide edits that address specific dimension gaps, much like a photographer iterates on lighting to illuminate a subject’s features more clearly.

Limitations, cautions, and the road ahead

As with any attempt to quantify taste, there are boundaries to what this work can claim. Beauty is culturally loaded, deeply personal, and frequently situational. The six-figure dataset and the twelve-dimensional standard are a robust start, but they can’t magically capture every flavor of aesthetics across all communities, professions, and artistic traditions. The authors acknowledge that some metrics still lag behind the human baseline, and that the dataset, despite its scale, is a living foundation that will need ongoing validation and expansion to remain fair and representative.

There are also technical considerations. The cross-domain pairing of data with different score scales is tricky, and even with the 12-dimension scaffold, the model must avoid overfitting to the specific distribution of images in HumanBeauty. The zero-shot evaluations offer encouraging signs of generalization, but real-world deployment will demand ongoing monitoring for biases and drift as aesthetics evolve with fashion, technology, and social norms.

Still, the work represents a meaningful convergence of vision, language, and structured human knowledge. It’s a reminder that the best AI tools in subjective domains may not be those that pretend to replace human judgment but those that learn to reason with human-like nuance—how we describe beauty to one another, how we argue about it, and how those conversations can be embedded into machines that help us see more clearly what we’re looking for in an image.

From dataset to deployment: what we learn about building thoughtful AI

Beyond the specifics of aesthetics, the HumanBeauty/HumanAesExpert project offers a blueprint for building responsible, adaptable AI systems in highly subjective domains. The combination of a richly labeled dataset, a multi-headed architecture that preserves language understanding while offering interpretable, dimension-level insights, and a learned fusion mechanism to balance different reasoning streams is a design pattern that could apply to other nuanced tasks—creative evaluation, artistic style analysis, or even moral and ethical reasoning in AI contexts.

The team behind the study—principally researchers from Tsinghua University and Kuaishou Technology—frames this work as a foundation. They’ve released datasets, models, and code to invite broader participation and critique, a stance that helps guard against the isolation of a single lab’s perspective. In a field where the next breakthrough can be a completely different architecture, this emphasis on shared resources creates a community scaffold for iterative improvement and democratic scrutiny.

As we move forward, the question isn’t merely whether machines can learn to rank beauty, but how they can do so in ways that honor diverse human sensibilities while remaining transparent, fair, and useful. The HumanBeauty project is a provocative and carefully executed step in that direction—a demonstration that when we measure something as personal as aesthetics with care, we can build AI tools that are less about policing taste and more about helping people understand it more deeply.

The study cites the university and research ecosystem behind the work—Tsinghua University and Kuaishou Technology—and names the principal investigators and contributors, including Zhichao Liao of Tsinghua University and colleagues Xiaokun Liu, Wenyu Qin, Qingyu Li, Qiulin Wang, Pengfei Wan, Di Zhang, Long Zeng, and Pingfa Feng. It’s a collaboration that blends academic rigor with real-world applicability, a pairing that’s likely to accelerate both scientific understanding and practical tools for media, design, and content platforms in the years ahead.

In the end, what this work leaves us with is a provocative, human-centric lens on AI’s taste-making abilities. It isn’t about a perfect impersonation of human judgment, but about a computational partner that can break beauty down into meaningful, explainable parts, learn from them, and, crucially, tell a story about why a particular image resonates. If that kind of storytelling capability scales, it could reshape not just how AI sees pictures, but how we think about the pictures we create, curate, and share with the world.