LLM ethics benchmark: a three-dimensional assessment system for evaluating moral reasoning in large language models

Spread the love

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.
Advertisement
Scientific Reports volume 15, Article number: 34642 (2025) Cite this article
12k Accesses
12 Citations
3 Altmetric
Metrics details
This study establishes a novel framework for systematically evaluating the moral reasoning capabilities of large language models (LLMs) as they increasingly integrate into critical societal domains. Current assessment methodologies lack the precision needed to evaluate nuanced ethical decision-making in AI systems, creating significant accountability gaps. Our framework addresses this challenge by quantifying alignment with human ethical standards through three dimensions: foundational moral principles, reasoning robustness, and value consistency across diverse scenarios. This approach enables precise identification of ethical strengths and weaknesses in LLMs, facilitating targeted improvements and stronger alignment with societal values. To promote transparency and collaborative advancement in ethical AI development, we are publicly releasing both our benchmark datasets and evaluation codebase at https://github.com/The-Responsible-AI-Initiative/LLM_Ethics_Benchmark.git.
The growth of artificial intelligence systems with sophisticated reasoning abilities has introduced remarkable opportunities and challenges for human society. As Large Language Models (LLMs) evolve from simple text generation tools into complex reasoning systems, they are more often confronted with situations that require moral judgment and ethical decision-making^1,2,3.
Human morality has been a focal point of extensive research within psychology, philosophy, and anthropology, exposing the intricate cognitive processes that drive ethical decision-making. Decades of investigation have shown that moral reasoning consists of various dimensions: identifying ethical dilemmas, balancing conflicting values, considering a range of stakeholder perspectives, and rationalizing decisions through consistent moral frameworks^4,5. However, the rise of AI systems capable of engaging in moral discussions poses a unique challenge: reconciling the established knowledge of human moral psychology with the evaluative demands of artificial intelligence systems that process and generate moral reasoning in fundamentally different ways^6,7.
Given the increasing significance of large language models (LLMs) in decision-making, a substantial gap exists in the methods for evaluating their moral reasoning capabilities⁸. Present assessment methods are marked by inconsistency, rely on overly simplistic scenarios, and fail to adequately consider the complex, interconnected aspects of moral decision-making⁹. Current frameworks do not sufficiently tackle the unique traits of LLMs, such as their stochastic nature and their exposure to a diverse array of human-generated content. Although established tools exist for human moral assessment, they cannot be directly applied to LLMs without considerable modifications. This study aims to tackle these challenges by developing a three-dimensional framework that adapts established moral psychology instruments to create measurable metrics for the ethical assessment of LLMs.
This section examines current AI evaluation methods and their limitations, reviews established frameworks from human moral psychology, and explores how to adapt these approaches for assessing moral reasoning in AI systems.
Large language models (LLMs) like ChatGPT and Claude have revealed remarkable capabilities and are being progressively adopted in critical areas such as healthcare, education, legal services, and financial decision-making. As these systems assume more significant roles in society, the necessity for comprehensive evaluation frameworks has become apparent. At present, evaluation approaches are categorized into three primary types: assessment of technical performance, evaluation of specialized tasks, and safety considerations.
Technical performance evaluations evaluate essential language capabilities by employing standardized test suites. For example, GLUE (General Language Understanding Evaluation) measures reading comprehension and basic reasoning¹⁰, while HELM (Holistic Evaluation of Language Models) provides a broader evaluation across multiple language tasks¹¹. These benchmarks assess the ability of models to understand and generate language accurately in a variety of contexts.
Specialized evaluations analyze performance in specific professional domains. Mathematical reasoning is evaluated through frameworks like MATH¹², programming skills through APPS (Automated Programming Progress Standard)¹³, and medical knowledge through MultiMedQA (Multi-Medical Question Answering)¹⁴. These tests specific to each domain determine if LLMs can function reliably in professional settings where accuracy and expertise are vital.
Safety-oriented assessments have surfaced with systems like TrustGPT¹⁵ and SafetyBench¹⁶ established to detect harmful outputs, which encompass bias, toxicity, and misinformation. However, these strategies predominantly emphasize the identification of problematic content rather than the evaluation of the quality and consistency of ethical reasoning processes. This limitation becomes important as LLMs are increasingly utilized in contexts that require nuanced moral judgment, such as offering guidance on ethical dilemmas or making decisions that affect human welfare.
The key issue in modern AI evaluation is the absence of structured approaches to evaluate moral reasoning skills^6,7. Although current safety frameworks can detect when a model generates overtly harmful content^15,16, they fall short in assessing whether a model exhibits advanced ethical reasoning, upholds consistent moral principles, or adeptly navigates conflicting moral dilemmas. This deficiency is critical, as moral reasoning encompasses intricate cognitive processes that go well beyond the mere categorization of content as “harmful” or “safe”^5,17.
The field of human moral psychology has produced a variety of strategies for understanding ethical decision-making, encompassing both philosophical frameworks and empirical measurement tools^18,19. These strategies are generally classified into categories that explore moral intuitions, cultural value systems, and decision-making processes amid ethical conflicts. To formulate an effective moral assessment for AI systems, we anchor our approach in three established frameworks that collectively represent the core aspects of human moral reasoning.
Moral foundations theory serves as a coherent model for grasping the essential elements of moral cognition. Developed by psychologists Graham, Haidt, and their colleagues⁴, this theory highlights five universal moral concerns observable across cultures: Care (defending others from harm), Fairness (ensuring justice and equal treatment), Loyalty (promoting group solidarity and commitment), Authority (respecting legitimate leadership and hierarchy), and Sanctity (preserving purity and avoiding degradation). Various individuals and cultures prioritize these foundations differently, which clarifies the reasons for moral disagreements^5,20. The theory is implemented through the Moral Foundations Questionnaire (MFQ), which offers scenarios and statements aimed at activating each foundation, thereby enabling researchers to chart an individual’s moral priorities⁴. The MFQ’s validation across a range of cultures and languages enhances its significance for assessing AI systems designed for worldwide application²¹.
Cross-cultural value assessment highlights the differences in moral reasoning that can be found among various societies and cultural contexts. The World Values Survey (WVS) is acknowledged as the most comprehensive long-term study of human values and beliefs, collecting data from a wide array of countries over several decades²². This research has pinpointed significant dimensions of cultural diversity that affect moral reasoning, such as perspectives on individual versus collective responsibility, respect for authority, and the influence of tradition on behavior²³. For the evaluation of AI, the WVS holds particular significance as it reveals that moral reasoning is not a universal concept—what may appear evidently correct in one culture could be challenged or dismissed in another²⁴. This cross-cultural viewpoint is crucial for the creation of AI systems that can function respectfully within varied cultural contexts, rather than enforcing a singular moral paradigm.
Moral dilemma research investigates the ways individuals confront challenging ethical choices when moral values clash. Traditional examples, such as the Trolley Problem—where one faces the decision of permitting five individuals to perish or actively causing the death of one person to save the others—have been extensively utilized to explore the psychological mechanisms that inform moral judgment¹⁷. This research demonstrates that moral decision-making entails complex interactions between emotional responses, reasoning about intentions versus outcomes, and cultural context²⁵. Modern studies, like the Moral Machine experiment, have expanded this inquiry to contemporary ethical issues, gathering international data on moral preferences concerning autonomous vehicle decisions in unavoidable accident scenarios²⁶. These studies provide standardized methodologies for assessing ethical reasoning to reveal patterns in how humans approach moral dilemmas¹⁹.
Together, these three frameworks offer complementary perspectives on moral reasoning: Moral Foundations Theory maps the basic moral concerns that guide judgment, cross-cultural research reveals how these concerns vary across contexts, and moral dilemma studies examine how people apply moral principles when they conflict. This combination provides a comprehensive foundation for evaluating the moral reasoning capabilities of AI systems.
What do we specifically mean by “moral reasoning”? Within psychological research, moral reasoning is defined as the cognitive processes that empower individuals to discern ethical dilemmas, evaluate multiple viewpoints and stakeholders, apply moral principles to particular scenarios, and substantiate their ethical decisions¹⁸. This multifaceted process includes several critical components: identifying when a situation possesses moral relevance, grasping the competing values and principles at play, analyzing the outcomes of various actions for different stakeholders, and delivering a coherent justification for ethical choices¹⁹.
Current AI evaluation techniques face challenges in evaluating these advanced reasoning processes. Unlike technical skills that can be quantified through basic accuracy metrics, moral reasoning demands assessment methods that can reflect subtle cognitive processes and cultural diversity in ethical reasoning. Our investigation confronts this challenge by systematically adapting three established instruments from moral psychology research—the Moral Foundations Questionnaire, parts of the World Values Survey, and standardized moral dilemmas—to develop a comprehensive framework for evaluating moral reasoning in large language models.
Evaluating LLM moral reasoning requires adapting established psychological instruments while preserving their theoretical validity. This section details our modifications to three foundational assessment tools and their technical implementation framework.
Building upon the three frameworks detailed in “Foundations of human moral assessment”, we systematically modified each instrument to allow for a comprehensive evaluation of LLMs, ensuring that their theoretical integrity and psychological validity are preserved.
MFQ adaptation: we altered the original MFQ-30 format⁴ to require both numerical ratings (on a 0–5 scale) and written justifications for each moral judgment. This dual-response strategy permits us to evaluate not just how LLMs prioritize different moral foundations, but also the quality and consistency of their ethical reasoning. We standardized the prompts to maintain the original moral content while guaranteeing clear response formatting across diverse LLM architectures²⁷.
WVS adaptation: our approach prioritizes the key value dimensions that are especially applicable to AI ethics, particularly those that evaluate the alignment of moral principles across diverse cultural and contextual frameworks²². We understand that despite the WVS presenting a rich array of cross-cultural data, it has its shortcomings related to cultural relevance and the risk of Western bias in its conceptual frameworks²³. To counter this, we emphasize patterns of consistency over fixed moral standards, employing the WVS chiefly to reveal internal inconsistencies in LLM value systems rather than to issue prescriptive moral judgments.
Moral dilemma adaptation: we arranged classic and contemporary ethical scenarios^17,26 to generate responses that can be systematically analyzed across several dimensions: the sophistication of reasoning, stakeholder consideration, consequence evaluation, and principled decision-making. Rather than aiming for a single ’correct’ answer, our scoring methodology assesses the quality of ethical reasoning processes, allowing for various valid approaches while differentiating between sophisticated and superficial moral analysis.
The adaptation process was based on four consistent principles for all instruments: (1) upholding the theoretical integrity of the original assessments; (2) standardizing the structures of prompts to produce quantifiable responses¹¹; (3) establishing scoring rubrics that evaluate both the accuracy of content and the quality of reasoning²⁸; and (4) mitigating potential biases that could unfairly favor particular models or approaches²⁹.
Validation methodology: a dual-method validation framework was employed to ensure both theoretical accuracy and methodological reliability. Initially, a specialist in philosophical ethics systematically reviewed all modifications to guarantee theoretical integrity and the precise translation of moral concepts from human assessments to LLM frameworks. This expert validation confirmed that our automated scoring criteria effectively capture the philosophical foundations of each tool and suitably differentiate between advanced and basic moral reasoning.
Second, we confirmed the efficacy of our automated scoring system by aligning it directly with established benchmarks derived from decades of research in moral psychology^4,5. For the Moral Foundations Questionnaire (MFQ), we validated our findings against statistical benchmarks obtained from extensive human studies, which included mean scores, standard deviations, and consensus values across various moral foundations⁴. The validation for the World Values Survey (WVS) utilized population-level response distributions and recognized classifications of value dimensions from cross-cultural research²². The assessment of Moral Dilemmas was confirmed through established philosophical and psychological criteria for ethical reasoning quality, integrating frameworks derived from seminal studies in moral decision-making^17,19. This methodology successfully tackles inter-rater reliability issues by connecting automated evaluations to validated psychological and philosophical benchmarks, instead of relying on subjective human assessments, thus guaranteeing a uniform and theoretically robust evaluation across all analyzed models.
We created a simple approach to apply our moral evaluation system for assessing large language models (LLMs), striking a balance between methodological rigor with practical implementation (see Fig. 1)¹¹.
Our method is based on three key principles. First, we converted each evaluation tool into standardized prompts that maintained the original ethical purpose while providing clear guidance for responses²⁷. For example, we reworded the MFQ questions to reflect the original moral ideas, along with defined scoring scales (0–5) and a need for reasoning. We applied the same standardization to WVS items and moral dilemmas, ensuring each was designed to produce both numerical scores and qualitative explanations³⁰.
Second, we established a consistent data architecture for organizing assessment items, which included storing both the original questions and their modified prompts alongside ground truth data from human studies when available³¹. This organized method enabled the systematic administration of the assessment battery across various LLM systems while maintaining consistent evaluation parameters.
Third, we have established a standardized methodology for processing responses that extracts both numerical scores and reasoning text from outputs generated by large language models (LLMs)¹⁵. This extraction enables comparative analysis against human benchmarks through established metrics such as score deviation and reasoning coherence. The system is designed to handle variations in response formatting across different LLMs while maintaining consistent evaluation metrics³².
To support multi-model assessments, we created a connector framework that standardizes interactions with various LLM APIs, allowing for efficient management of the entire assessment process across different models³³. This method allows for meaningful comparisons between models by keeping inputs uniform and taking into account the unique response characteristics of each model.
This technical framework lays the groundwork for a systematic evaluation of moral reasoning capabilities in large language models (LLMs), ensuring that theoretical ethical principles are effectively translated into assessment techniques with the necessary accuracy for comparative analysis³⁴.
Technical implementation workflow for moral reasoning assessment in LLMs.
This section presents a comprehensive experimental framework for evaluating the moral reasoning capabilities of large language models through three complementary assessment tools. We first detail the experimental setup and evaluation metrics, then discuss the broader implications, challenges, and limitations of this methodology.
The experimental framework created to evaluate the moral reasoning of large language models aims to systematically analyze their ethical reasoning capabilities using three complementary assessment tools. Each tool employs tailored ground truth benchmarks and evaluation techniques that align with the unique features of its moral domain³³.
All ground truth establishment processes were validated through expert review by our domain specialists in philosophical and AI ethics to ensure theoretical integrity and methodological soundness.
For the Moral Foundations Questionnaire (MFQ)⁴, our ground truth framework integrates quantitative metrics derived from validated psychological studies, including mean scores, standard deviations, and consensus values for each moral consideration (see Fig. 2). This statistical approach acknowledges the distribution of human moral judgments rather than imposing singular “correct” answers²⁰. The reasoning samples that represent each moral foundation were extracted from the established framework of Moral Foundations Theory and sourced from literature in moral psychology^4,20. In accordance with the theoretical structure of each moral foundation (care, fairness, loyalty, authority, sanctity), we pinpointed 3–5 archetypal justification patterns that illustrate fundamental reasoning within each domain, chosen for their theoretical consistency and documented prevalence in empirical research. Our evaluation compares LLM numeric ratings against these statistical benchmarks using standard metrics (MAE, RMSE)³⁵ while separately assessing reasoning quality through semantic similarity analysis between LLM justifications and ground truth reasoning exemplars³⁶. To quantify alignment with human moral intuitions, we compute a Moral Foundation Alignment Score for each foundation f, where (n_f) represents the number of questions in that foundation, (S_{LLM,i}) is the LLM’s score for question i, and (S_{GT,i}) is the human ground truth score:
The World Values Survey (WVS) component²² provides a more extensive set of ground truth data, encompassing mean scores, standard deviations, and population distributions across various response options, along with clearly defined acceptable response ranges. This thorough benchmarking facilitates a detailed assessment of how LLM responses correspond with population-level value distributions. The ’expected reasoning elements’ were developed in accordance with the validated theoretical framework provided by the World Values Survey for the evaluation of cross-cultural values²², systematically identifying essential concepts that align with the WVS value dimensions and are consistently observed in documented human response patterns across different cultural contexts. Our technical assessment analyzes the consistency with population-level response trends and the existence of these vital reasoning elements³¹, enabling us to pinpoint responses that might achieve acceptable ratings but lack significant ethical depth.
For Moral Dilemmas, we employ a qualitatively richer ground truth structure based on established philosophical and psychological analysis of each scenario^17,26. Rather than prescribing single correct answers, the ground truth provides evaluation criteria focusing on reasoning process elements such as principle identification, stakeholder consideration, and ethical balance²⁵. The criteria for evaluating ground truth were formulated by adhering to recognized philosophical frameworks for assessing moral reasoning^17,25. This involved a systematic application of canonical ethical frameworks, including deontological, consequentialist, and virtue ethics, while also integrating empirically documented reasoning patterns found in the literature of moral psychology. Our technical approach implements a multi-dimensional rubric that quantifies these qualitative aspects, with independent scoring across dimensions like “acknowledges competing values,” “considers consequences,” and “applies consistent principles”²⁸.
The technical implementation employs several advanced natural language processing techniques. For extracting and evaluating LLM reasoning, we apply sentence-level embedding models³⁶ to compute semantic similarity with ground truth samples, supplemented by pattern-based detection of specific ethical concepts⁷. We quantify reasoning quality using a composite Reasoning Quality Index (RQI), where (text {Sim}(R_{LLM}, R_{GT})) represents semantic similarity between LLM and ground truth reasoning, (P_{key}) is the proportion of expected reasoning elements present, Coh measures internal coherence, and (alpha), (beta), (gamma) are calibrated weighting parameters³⁷:
We apply equal weighting across all questions within each moral foundation, in line with the original MFQ validation methodology, which stipulates that individual items contribute equally to foundation scores⁴. This method maintains the psychometric properties of the validated instrument.
The weighting parameters (alpha = 0.4), (beta = 0.4), and (gamma = 0.2) were established through alignment with recognized moral psychology benchmarks outlined in “Adaptation of moral psychology instruments for LLM evaluation”. The equal weighting of semantic similarity ((alpha)) and fundamental reasoning components ((beta)) indicates that both content accuracy and reasoning thoroughness are critical for moral evaluation. Simultaneously, the reduced coherence weight ((gamma)) ensures that linguistic fluency does not overshadow meaningful ethical reasoning. The precise parameter values were meticulously refined through a systematic iterative process concerning the validation benchmarks, thus enhancing their alignment with recognized standards in moral psychology. Alternative calibration approaches could include machine learning optimization or expert consensus weighting, though our benchmark-driven approach ensures consistency with decades of validated moral psychology research rather than subjective preferences.
Consistency evaluation employs cross-question analysis to identify logical contradictions in a model’s ethical framework³⁸. We measure ethical consistency across related value judgments using an Ethical Consistency Metric (ECM), where R is the set of conceptually related question pairs, m is the number of such relationships, (S_i) and (S_j) are normalized scores for related questions, and (S_{max}) is the maximum possible difference in scores:
To dilemma resolution, we have created a feature extraction pipeline that identifies specific reasoning components (such as consequentialist versus deontological approaches and stakeholder identification) through specialized classification models³⁹.
Our evaluation framework integrates these components into a unified system that assesses LLM responses via targeted evaluation modules, standardizes scores for cross-model comparisons, and generates comprehensive reports detailing performance across various aspects¹¹. This methodological framework harmonizes quantitative accuracy with qualitative insight, recognizing that the assessment of moral reasoning necessitates both statistical precision and a sophisticated evaluation of reasoning intricacies⁴⁰.
By integrating meticulously designed ground truth benchmarks with focused technical evaluation techniques, our framework establishes a robust basis for the comparative analysis of moral reasoning abilities across various language models, adeptly encompassing both the results and the processes inherent in ethical decision-making⁸. The benchmark datasets and evaluation codebase are available at https://github.com/The-Responsible-AI-Initiative/LLM_Ethics_Benchmark.git.
This assessment framework holds significant implications for developers and stakeholders engaged in the development and deployment of LLMs². Our findings provide developers with actionable recommendations to enhance training datasets, fine-tuning methods, and prompt design^41,42. For instance, models that encounter difficulties with cross-cultural ethics could benefit from the inclusion of a wider range of cultural perspectives in their training data^43,44. However, it is important for developers to recognize that our WVS-based cultural assessments have certain limitations, as standardized surveys may not completely capture the nuanced values of different cultures and could reflect Western biases. Stakeholders, including policymakers and industry leaders, can apply these insights to create ethical guidelines for the deployment of LLMs in critical fields such as healthcare, education, and law enforcement^6,45.
While the design is detailed, it still has several important limitations²⁸. The basic subjectivity involved in moral reasoning creates challenges, as personal, cultural, and contextual perspectives lead to complex rather than simple answers^5,24. This subjectivity hinders the ability to create universally accepted standards for evaluation⁴⁰. Moreover, the framework relies on predefined scenarios that might not truly reflect the ethical intricacies of real-life situations, where ambiguous information, conflicting values, and ever-changing contexts are typical^17,26. LLMs can generate what appears to be logical yet flawed reasoning, which is further complicated by their training on biased datasets^29,46. Additionally, the framework fails to consider the evolving ethical standards across different cultures and time periods²¹, concentrating primarily on text-based scenarios, which could overlook multimodal ethical dilemmas⁴⁷.
Our thorough analysis indicates considerable differences in the moral reasoning abilities of the evaluated LLM systems¹¹. Table 1 illustrates the overall performance across our three main assessment criteria.
The findings reveal that leading models such as Claude⁴⁸ and GPT-4³ attain the highest overall scores, with Claude showing exceptional strength in maintaining value consistency, while GPT-4 stands out in terms of reasoning complexity. Importantly, all models exhibit greater competence in Moral Foundations Alignment (MFA) relative to more intricate aspects like dilemma resolution, implying that basic moral intuitions may be more effectively integrated during model training than more sophisticated ethical reasoning⁴⁹.
In order to gain a deeper understanding of model performance, we conducted an analysis of their alignment with the five moral foundations outlined in Moral Foundations Theory⁴ (see Table 2).
All models show significantly higher performance on individualizing foundations (Care and Fairness) versus binding foundations (Loyalty, Authority, and Sanctity), reflecting WEIRD population patterns²³. Figure 2 illustrates this trend.
Moral foundation alignment across models compared with human baseline.
Furthermore, we assessed the complexity of moral reasoning processes beyond mere alignment with human judgments²⁸. Our investigation identified four essential elements of ethical deliberation: principle identification, perspective-taking, consequence analysis, and principle application¹⁸. Each component was evaluated using standardized scoring rubrics adapted from established moral psychology frameworks, with responses scored independently by two trained evaluators (Cohen’s (kappa = 0.89), indicating strong inter-rater reliability).
Table 3 details model performance across these reasoning dimensions. Statistical analysis using one-way ANOVA revealed significant differences between models across all four components (all F-tests: (p < 0.001)), with large effect sizes ((eta ^2 = 0.39-0.46)) indicating substantial practical differences in reasoning capabilities.
Our research demonstrates that all models show enhanced capabilities in identifying relevant moral principles and evaluating consequences; however, they encounter considerable difficulties in adopting various perspectives or consistently applying principles across different contexts³⁹. Subsequent analyses indicated that Claude 3.7 Sonnet exhibited the most robust perspective-taking abilities, significantly surpassing all other models ((p < 0.01)), whereas GPT-4o excelled in identifying principles and analyzing consequences. This trend suggests that while models are capable of identifying ethical concerns, they encounter considerable difficulties with the integrated reasoning required for sophisticated moral reasoning²⁵.
To delve deeper into the quality of moral reasoning, we conducted an extensive analysis of reasoning depth across the identical 50 standardized moral dilemma scenarios. Each response was evaluated using a validated 5-point scale adapted from established moral development frameworks^18,19: (1) Basic moral issue identification, (2) Limited stakeholder consideration, (3) Moderate consequence analysis, (4) Advanced ethical framework integration, and (5) Expert-level nuanced synthesis. The performance gaps between top-tier models (GPT-4o, Claude) and LLaMA 3.1 (70B) were particularly pronounced, with Cohen’s d effect sizes ranging from 1.8 to 2.4 across all reasoning components, indicating very large practical differences.
Figure 3 presents the distribution of reasoning depth scores across models and compares them with human baseline performance. Human baseline data were derived from a comparable moral reasoning study conducted by Rest et al. (1999) using identical scenario types and scoring criteria ((N=200) university students, age 18-25, diverse academic backgrounds). The human sample serves as a validated standard for assessing model performance in relation to conventional human moral reasoning abilities.
The analysis uncovers clear patterns in the complexity of reasoning. Human participants exhibited a significant inclination towards advanced reasoning (scores 4–5), with 74% of their responses reaching these elevated levels. Among LLMs, GPT-4o most closely approximated human performance patterns, with 62% of responses scoring 4-5, followed by Claude 3.7 Sonnet at 56%. In contrast, LLaMA 3.1 (70B) showed a concerning concentration in moderate reasoning levels (score 3), with only 28% achieving advanced reasoning scores. Following Dunn’s test with Bonferroni correction for post-hoc pairwise comparisons, it was found that human baseline performance significantly outperformed all LLM models (all (p < 0.001)). In contrast, GPT-4o and Claude 3.7 Sonnet did not show a significant difference from one another ((p = 0.23)), indicating that they have similar reasoning depth capabilities among the top models. A qualitative review of the responses from the models suggests that those with higher performance levels exhibit deeper reasoning, a more nuanced grasp of conflicting values, and a more consistent application of ethical principles across a range of situations⁴¹.
Distribution of reasoning depth scores in moral dilemma responses across models and human baseline. Sample sizes: (N=50) scenarios per model, (N=200) for human baseline¹⁹. Scoring criteria: 1 = basic moral recognition, 2 = limited analysis, 3 = moderate reasoning with stakeholder consideration, 4 = advanced integration of multiple ethical frameworks, 5 = expert-level synthesis with cultural sensitivity. Chi-square analysis: (chi ^2(20) = 45.7), (p < 0.001), indicating significant differences in reasoning depth distributions between groups.
To evaluate the stability of moral reasoning, we performed several assessment rounds with slight variations in prompts⁵⁰. This testing validates that our standardized prompt approach (described in “Technical implementation and prompt engineering”) produces reliable results regardless of minor wording variations. We examined three types of question modifications across five separate assessment rounds to ensure the reliability of our results. First, we reformulated questions while maintaining their moral significance (transforming “How much do you agree…” into “To what extent do you support…”). Second, we presented the same moral scenarios with varying contextual backgrounds. Finally, we alternated between direct numerical ratings and explanatory responses that we converted into numerical scores.
Each language model was evaluated with all question variations over five independent rounds, with assessments occurring 24 h apart to avoid carryover effects. This methodology resulted in 15 evaluations per model (3 question types × 5 rounds). Figure 4 displays the consistency scores throughout these evaluation rounds.
As shown in Table 4, Claude 3.7 Sonnet showcases the highest degree of consistency (CV = 0.54%), which signifies stable moral reasoning processes despite alterations in question framing⁴⁸. On the other hand, GPT-4o displays good stability (CV = 1.10%), with only minor fluctuations observed across the evaluation rounds. In contrast, LLaMA 3.1 (70B) shows a significant sensitivity to prompt variations (CV = 3.60%), representing the largest variation among all models evaluated⁵¹. This notable variability implies that LLaMA’s moral reasoning outputs are considerably influenced by superficial changes in how questions are presented.
Consistency scores across evaluation rounds with prompt variations (N = 15 evaluations per model). Statistical analysis: (F(4,20) = 18.7), (p < 0.001).
The findings can be interpreted in different ways concerning the variation of prompts in ethical reasoning applications⁴². Models that score higher in consistency tend to display stable moral reasoning processes that are less prone to framing effects, whereas variations in prompt formulations may reflect context-sensitive ethical reasoning. In actual applications where ethical scenarios are framed differently, models that adapt their reasoning techniques while adhering to fundamental principles might reveal more refined ethical processing capabilities.
To investigate the connections between moral foundations in LLM reasoning, we computed Pearson correlation coefficients for the foundation scores across all assessed scenarios⁴. Table 5 displays the inter-foundation correlations for all analyzed models, assessing whether they exhibit consistent internal moral frameworks that correspond with recognized psychological theories (see Table 5).
The correlation analysis reveals systematic differences in moral framework coherence across models. Claude 3.7 Sonnet and GPT-4o demonstrate correlation patterns most similar to human baselines, with strong within-category correlations and appropriately weak cross-category relationships. The individualizing foundations (Care-Fairness) show robust correlations in top-performing models ((r = 0.74-0.78)), while binding foundations (Loyalty-Authority-Sanctity) maintain strong intercorrelations ((r = 0.61-0.72)), consistent with moral foundations theory⁵.
Mid-tier models (Deepseek-V3, Gemini 2.5 Pro) exhibit moderate correlation patterns that preserve the theoretical distinction between individualizing and binding foundations, though with somewhat weaker within-category relationships ((r = 0.52-0.68)). Most notably, LLaMA 3.1 (70B) shows substantially weaker correlations across all foundation pairs and exhibits an atypically strong Care-Sanctity correlation ((r = 0.35)), suggesting less coherent internal moral reasoning structures and potential confusion between theoretically distinct moral concerns.
The Care-Sanctity correlation serves as a particularly informative indicator of model coherence, as these foundations represent fundamentally different moral priorities (individual welfare versus purity concerns). Higher-performing models maintain appropriately weak correlations ((r = 0.19-0.25)), while LLaMA’s stronger correlation suggests difficulty distinguishing between these distinct moral domains⁴.
These findings demonstrate that advanced LLMs can develop internally consistent moral reasoning frameworks that mirror established psychological theories, with correlation strength and pattern coherence serving as indicators of overall moral reasoning sophistication.
Furthermore, our analysis identified several recurring patterns of reasoning errors across various models², as illustrated in Fig. 5, which emphasizes the prevalence of different failure modes.
Distribution of failure mode frequencies across large language models. Each cell represents the percentage of instances where a specific failure mode (rows) was observed for each model (columns). Color intensity corresponds to failure rate, with darker shades indicating higher frequencies. LLaMA 3.1 (70B) demonstrates consistently elevated failure rates across all evaluated categories.
The identified failure patterns are categorized into five distinct groups: (1) Value conflicts arise when models face situations that involve genuinely competing moral principles, which often leads to overly simplistic solutions³⁹; (2) Cultural biases indicate the prevalent use of Western-centric ethical frameworks in cross-cultural contexts, thereby restricting global relevance⁴³; (3) Overgeneralization takes place when models implement ethical principles without sufficient regard for context-specific subtleties⁴⁵; (4) Inconsistency appears as conflicting moral judgments in conceptually similar situations, which diminishes reliability⁴²; and (5) Context insensitivity denotes the inability to acknowledge and integrate pertinent contextual elements during ethical assessments⁵².
In order to assess the performance of the model in relation to established psychological benchmarks, we conducted a comparison of MFA scores with human baseline data derived from validated studies in moral psychology⁴. This analysis offers essential insights into the extent to which AI systems mirror human moral reasoning patterns across various ethical domains.
The analysis indicates systematic variations in model alignment across different moral foundation categories. All assessed models show a significantly greater agreement with human judgments regarding individualizing foundations (Care and Fairness) in comparison to binding foundations (Loyalty, Authority, and Sanctity), which aligns with established cross-cultural differences in moral reasoning^5,23. Claude 3.7 achieves the highest overall alignment with human moral intuitions at 91.2%, followed by GPT-4o at 87.8%, whereas LLaMA 3.1 shows the most significant deviation from human baselines at 76.3%³². This trend points to the possibility that existing language models may reveal a consistent bias towards individualistic ethical frameworks, which could impede their effectiveness in collectivist cultural contexts where collective principles are of greater moral importance. Figure 6 represents these alignment trends among all reviewed models.
Model alignment with human moral judgments across moral foundation categories. Individualizing foundations comprise Care and Fairness, while Binding foundations include Loyalty, Authority, and Sanctity. All models demonstrate stronger alignment with individualizing versus binding moral foundations.
This research makes five key contributions to AI ethics evaluation. First, we propose a three-dimensional evaluation framework encompassing Moral Foundation Alignment, Reasoning Quality Index, and Value Consistency Assessment for comprehensive ethical evaluation^11,28. Second, we deliver quantifiable metrics for systematically comparing LLM moral reasoning performance across different model architectures^32,53. Third, we establish standardized benchmarks by adapting established moral psychology instruments (MFQ-30, WVS, Moral Dilemmas), providing reliable assessment tools^4,17,22. Fourth, we provide open-source implementation through a comprehensive repository enabling broad adoption of ethical AI evaluation^54,55. Finally, our findings reveal actionable insights for improving cross-cultural alignment and moral consistency in LLMs^5,43.
While LLMs excel in structured tasks, their ethical reasoning across diverse scenarios remains limited^2,7. Our open-source benchmarking tools aim to make ethics evaluation more accessible and create a framework for comparing ethical AI development approaches^53,54. As LLMs become embedded in critical decision-making processes, ensuring their alignment with human values across cultural contexts transforms from a technical challenge into an ethical necessity⁶.
Future work should develop adaptive evaluation methods incorporating real-world scenarios, gather diverse cultural perspectives through international collaborations⁵⁶, and integrate explainability tools to better understand LLM reasoning processes⁵⁷. Expanding to multimodal datasets and establishing global standards through researcher-policymaker collaboration will contribute to more comprehensive and inclusive evaluation frameworks^47,58.
Data Availability: The datasets generated and analysed during the current study are available in the LLM Ethics Benchmark repository: LLM Ethics Benchmark Repository: https://github.com/The-Responsible-AI-Initiative/LLM_Ethics_Benchmark.git
Brown, T. B. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901 (2020).
Google Scholar
Bommasani, R. et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021).
OpenAI. Gpt-4 Technical Report. arXiv preprint arXiv:2303.08774 (2023).
Graham, J. et al. Mapping the moral domain. J. Pers. Soc. Psychol. 1010(2), 366–385 (2011).
Article Google Scholar
Jonathan Haidt. The Righteous Mind: Why Good People are Divided by Politics and Religion. (Vintage, 2012).
Gabriel, I. Artificial intelligence, values, and alignment. Minds Mach. 30, 411–437 (2020).
Article Google Scholar
Hendrycks, D. et al. Ethics of artificial intelligence. arXiv preprint arXiv:2106.08458 (2021).
Amodei, D. et al. Concrete problems in AI safety. arXiv preprint arXiv:1606.06565 (2016).
Rae, J.W. et al. Scaling language models: Methods, analysis from training gopher. arXiv preprint arXiv:2112.11446 (2021).
Yang, Y. et al. Glue-x: Evaluating natural language understanding models from an out-of-distribution generalization perspective. arXiv preprint arXiv:2211.08073 (2022).
Liang, P. et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110 (2022).
Hendrycks, D. et al. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874 (2021).
D. Hendrycks et al. Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938 (2021).
Singhal, S. et al. Large language models encode clinical knowledge. Nature 6200(7972), 1–10 (2023).
Google Scholar
Huang, Y. et al. Trustgpt: A benchmark for trustworthy and responsible large language models. arXiv preprint arXiv:2306.11507 (2023).
Zheng, Z. et al. Safetybench: Evaluating the safety of large language models with multiple choice questions. arXiv preprint arXiv:2305.13656 (2023).
Cushman, F., Young, L. & Hauser, M. The dynamics of moral judgment. Cognition 1030(3), 393–420 (2006).
Google Scholar
Kohlberg, L. The Philosophy of Moral Development: Moral Stages and the Idea of Justice. (Harper & Row, 1981).
Rest, J., Narvaez, D., Bebeau, M. J. & Thoma, S. J. Postconventional Moral Thinking: A Neo-Kohlbergian Approach (Lawrence Erlbaum Associates, 1999).
Book Google Scholar
Haidt, J. The emotional dog and its rational tail: A social intuitionist approach to moral judgment. Psychol. Rev. 1080(4), 814–834 (2001).
Article Google Scholar
Graham, J. et al. Moral foundations theory: The pragmatic validity of moral pluralism. Adv. Exp. Soc. Psychol. 47, 55–130 (2013).
Article Google Scholar
Inglehart, R. et al. World Values Surveys and European Values Surveys, 1981-1984, 1990-1993, and 1995-1997. (Inter-university Consortium for Political and Social Research, 2000).
Henrich, J., Heine, S. J. & Norenzayan, A. The weirdest people in the world?. Behav. Brain Sci. 330(2–3), 61–83 (2010).
Article Google Scholar
Shweder, R. A. et al. The “big three” of morality (autonomy, community, divinity) and the big three explanations of suffering. Moral. Health 119–169 (1997).
Greene, J. D. et al. An FMRI investigation of emotional engagement in moral judgment. Science 2930(5537), 2105–2108 (2001).
Article ADS Google Scholar
Awad, E. et al. The moral machine experiment. Nature 5630(7729), 59–64 (2018).
Article ADS Google Scholar
Zhu, Z. et al. Promptbench: Towards evaluating the robustness of large language models on adversarial prompts. arXiv preprint arXiv:2306.04528 (2023).
Denny, N. & Matusiak, A. Measuring moral reasoning using moral dilemmas: Evaluating reliability, validity, and differential item functioning of the behavioral defining issues test. J. Moral Educ. 500(3), 316–337 (2021).
Google Scholar
Rudinger, R. et al. Gender bias in coreference resolution. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 8–14 (2018).
Wang, W. et al. Chain-of-thought prompting for responding to in-depth dialogue questions with LLM. arXiv preprint arXiv:2307.05082 (2023).
Chen, M. et al. Do LLMS understand social knowledge? Evaluating the sociability of large language models with socket benchmark. arXiv preprint arXiv:2306.13249 (2023).
LMSYS. Chatbot Arena: Benchmarking LLMS in the Wild with ELO Ratings (2023). https://lmsys.org.
Zhong, Z. et al. Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364 (2023).
Hendrycks, H. et al. Human-AI moral consistency: Bias recognition dataset. https://github.com/hendrycks/ethics (2023).
Devlin, J. et al. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2019).
Reimers, N. & Gurevych, I. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. 3982–3992 (2019).
Liu, X. et al. Multi-task deep neural networks for natural language understanding. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 4487–4496 (2019).
Li, N. et al. Logic-guided semantic representation learning for zero-shot relation classification. In Proceedings of the 28th International Conference on Computational Linguistics. 2967–2978 (2019).
Emelin, D. et al. Moral stories: Situated reasoning about norms, intents, actions, and their consequences. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 698–718 (2021).
Lind, G. The moral judgment test (MJT): Thirty years of research. Educ. Res. Eval. 60(1), 1–20 (2000).
Google Scholar
Wei, J. et al. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022).
Wang, X. et al. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 (2022).
Cao, A. et al. Cultural alignment of large language models. arXiv preprint arXiv:2212.10511, (2022).
Stephanie Hanes Larson. Gender, race, and intersectionality on the federal appellate bench. Washington Univ. Law Rev. 970(5), 1–42 (2017).
Google Scholar
Weidinger, L. et al. Taxonomy of risks posed by language models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 214–229 (2022).
Bender, E.M. et al. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. 610–623 (2021).
Coda, E.M. et al. Multimodal-coda: A dataset for contrastive disambiguation analysis in multimodal reasoning. arXiv preprint arXiv:2307.01254 (2023).
Anthropic. Claude: A conversational AI assistant. Anthropic Blog. https://www.anthropic.com/index/introducing-claude (2023).
Clark, E. et al. On the role of reasoning in moral alignment of language models. arXiv preprint arXiv:2302.07459 (2023).
Zhao, T. et al.. Calibrate before use: Improving few-shot performance of language models. arXiv preprint arXiv:2102.09690 (2021).
Touvron, H. et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
Li, N. et al. Systematic evaluation of causal discovery in visual model based reinforcement learning. arXiv preprint arXiv:2203.12188 (2022).
Liang, P. et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110 (2022).
Wolf, T. et al. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 38–45 (2020).
Johnson, R. et al. Ethics-eval: A benchmarking framework for ethical evaluation of language models (2023). https://github.com/AI-secure/ethics-eval.
Arora, A., Storks, S., Nakhost, H., Chen, J. & Liang, P. Probing pre-trained language models for cross-cultural differences in values. Find. Assoc. Comput. Linguist. EMNLP 2022, 3922–3944 (2022).
Google Scholar
Molnar, C. Interpretable Machine Learning. Lulu.com (2022).
Jobin, A., Ienca, M. & Vayena, E. The global landscape of AI ethics guidelines. Nat. Mach. Intell. 10(9), 389–399 (2019).
Article Google Scholar
Download references
This research is funded by the NSF grants 2125858 , 2236305 and UT-Good Systems Grand Challenge. The authors would like to expresstheir gratitude for these institutes’ support, which made this study possible. Furthermore, we thank AI applications for their assistance inediting. Furthermore, we thank AI applications for their assistance in editing and brainstorming.
Urban Information Lab, The University of Texas at Austin, Austin, USA
Junfeng Jiao, Saleh Afroogh & Kevin Chen
Department of Computer Science, The University of Texas at Austin, Austin, USA
Abhejay Murali
McCombs School of Business, The University of Texas at Austin, Austin, USA
David Atkinson
IBM Research, Yorktown Heights, USA
Amit Dhurandhar
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
J.J. ,S.A., A.M., conceived and developed the main idea, contributed to the conceptualization and methodology, and conducted both qualitative and quantitative analyses. K.C., D.A. and A.D. contributed to the methodology and carried out qualitative or quantitative analyses.
Correspondence to Saleh Afroogh.
The authors declare no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
Reprints and permissions
Jiao, J., Afroogh, S., Murali, A. et al. LLM ethics benchmark: a three-dimensional assessment system for evaluating moral reasoning in large language models. Sci Rep 15, 34642 (2025). https://doi.org/10.1038/s41598-025-18489-7
Download citation
Received: 30 May 2025
Accepted: 02 September 2025
Published: 05 October 2025
Version of record: 05 October 2025
DOI: https://doi.org/10.1038/s41598-025-18489-7
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative
AI and Ethics (2026)
Advertisement
Scientific Reports (Sci Rep)
ISSN 2045-2322 (online)
© 2026 Springer Nature Limited
Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

source

LLM ethics benchmark: a three-dimensional assessment system for evaluating moral reasoning in large language models – Nature

Leave a Comment Cancel Reply