Comparing large Language models and human annotators in latent content analysis of sentiment, political leaning, emotional intensity and sarcasm

Spread the love

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.
Advertisement
Scientific Reports volume 15, Article number: 11477 (2025) Cite this article
16k Accesses
62 Citations
114 Altmetric
Metrics details
This article has been updated
In the era of rapid digital communication, vast amounts of textual data are generated daily, demanding efficient methods for latent content analysis to extract meaningful insights. Large Language Models (LLMs) offer potential for automating this process, yet comprehensive assessments comparing their performance to human annotators across multiple dimensions are lacking. This study evaluates the inter-rater reliability, consistency, and quality of seven state-of-the-art LLMs. These include variants of OpenAI’s GPT-4, Gemini, Llama-3.1-70B, and Mixtral 8 × 7B. Their performance is compared to human annotators in analyzing sentiment, political leaning, emotional intensity, and sarcasm detection. The study involved 33 human annotators and eight LLM variants assessing 100 curated textual items. This resulted in 3,300 human and 19,200 LLM annotations. LLM performance was also evaluated across three-time points to measure temporal consistency. The results reveal that both humans and most LLMs exhibit high inter-rater reliability in sentiment analysis and political leaning assessments, with LLMs demonstrating higher reliability than humans. In emotional intensity, LLMs displayed higher reliability compared to humans, though humans rated emotional intensity significantly higher. Both groups struggled with sarcasm detection, evidenced by low reliability. Most LLMs showed excellent temporal consistency across all dimensions, indicating stable performance over time. This research concludes that LLMs, especially GPT-4, can effectively replicate human analysis in sentiment and political leaning, although human expertise remains essential for emotional intensity interpretation. The findings demonstrate the potential of LLMs for consistent and high-quality performance in certain areas of latent content analysis.
In an era characterized by rapid digitization and the proliferation of online communication platforms, vast amounts of textual data are generated daily. While the proliferation of online communication provides rich insights, extracting meaning from vast textual data remains resource intensive. Latent content analysis, which involves decoding the underlying meanings, sentiments, and nuances in text, is crucial for understanding social dynamics, informing policy decisions, and guiding business strategies¹. Automating this process could significantly enhance our ability to respond to societal needs promptly and effectively.
The societal implications of effectively analyzing textual content are profound. Sentiment analysis can reveal public opinion on policies or products, influencing governmental decisions and corporate strategies². Understanding political leanings aids in assessing electoral landscapes and fostering democratic engagement³. Detecting emotional intensity and sarcasm in communication is vital for mental health monitoring, customer service, and even national security^4,5. Large Language Models (LLMs) offer the potential to perform these analyses at scale, reducing reliance on extensive human labor and accelerating the time to insight⁶.
The field of automated content analysis has evolved significantly over the past few decades. Early computational approaches relied on manual coding schemes applied to small datasets⁷. The advent of machine learning introduced algorithms capable of handling larger datasets with increased efficiency⁸. Traditional models, such as Naïve Bayes and Support Vector Machines, were used for tasks like sentiment classification but often struggled with contextual understanding⁹.
The introduction of deep learning architectures marked a transformative period in natural language processing (NLP). Models utilizing word embeddings captured semantic relationships between words¹⁰. Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks improved the modeling of sequential data¹¹. These advancements enhanced performance in sentiment analysis and emotion detection tasks¹².
Recent research has also explored integrating multi-modal data (e.g., text with images or audio) to enhance sentiment and effect detection, a challenge that LLMs may benefit from addressing¹³. Thareja¹⁴ addressed the challenges posed by extreme emotional sentiments on social media platforms like Twitter, which can impact users’ mental well-being. Introducing Tweet-SentiNet, a multi-modal framework utilizing both image and text embeddings, the study demonstrated improved sentiment analysis by effectively filtering content with extreme sentiments. Similarly, Li et al.¹⁵ proposed a multi-modal sentiment analysis model based on image and text fusion using a cross-attention mechanism. By extracting features using advanced techniques like ALBert for text and DenseNet121 for images, and then fusing them with cross-attention, their model outperformed baseline models on public datasets, achieving accuracy and F1 scores of over 85%. Akhtar et al.¹⁶ explored a deep multi-task contextual attention framework for multi-modal affect analysis. Recognizing that emotions and sentiments are interdependent, they leveraged the associations among neighboring utterances and their multi-modal information.
Recent approaches broaden the scope of stance detection by integrating knowledge graphs, which help capture political or ideological context in more structured ways¹⁷. Likewise, boundary-aware frameworks for few‐shot tasks have shown promise in enriching entity‐level interpretations¹⁸.
Despite these improvements, models have been found to still face challenges in interpreting complex linguistic features such as sarcasm and nuanced emotions¹³. Sarcasm detection, for instance, requires an understanding of contextual cues and sometimes external knowledge beyond the text itself^19,6, leading researchers to explore context-aware and multi-modal approaches to enhance detection accuracy. Baruah et al.²⁰ investigated the impact of conversational context on sarcasm detection using deep-learning (BERT, BiLSTM) NLP models and ML classifier (SVM). They found that incorporating the last utterance in a dialogue significantly improved classifier performance on Twitter datasets, achieving an F-score of 0.743 with BERT. Exploring the distinction between intended and perceived sarcasm, Oprea and Magdy²¹ introduced the iSarcasm dataset, which consists of tweets labeled for sarcasm directly by their authors emphasizing the need for datasets that reflect the intended use of sarcasm to improve detection systems.
The introduction of the Transformer architecture²² and pre-trained language models such as BERT²³ and RoBERTa²⁴ significantly advanced NLP capabilities. These models utilized attention mechanisms to capture long-range dependencies in text, leading to state-of-the-art results in various tasks.
Beyond Transformer architectures, emerging methods provide complementary solutions. For example, knowledge-graph‐based architectures can improve stance detection tasks¹⁷, and boundary‐aware LLM designs are increasingly valuable for few‐shot named entity recognition¹⁸. Integrating both constituency and dependency parse information has proven beneficial for relation extraction²⁵.
Large Language Models (LLMs) like GPT-2²⁶ and GPT-3²⁷ expanded these capabilities by increasing model size and training data. GPT-3, with 175 billion parameters, demonstrated remarkable proficiency in zero-shot and few-shot learning scenarios, performing well on tasks it was not explicitly trained for²⁷.
Recent studies have explored LLMs in sentiment analysis and related tasks. Chang & Bergen²⁸ investigated the use of GPT-3 for sentiment classification and found that it performed competitively with fine-tuned models on specific datasets. Similarly, Floridi and Chiriatti²⁹ discussed the potential of GPT-3 in understanding and generating human-like text, highlighting its applicability in content analysis.
The incorporation of context-aware mechanisms²⁰, consideration of intended versus perceived meanings²¹, and the use of multi-modal data^14,15,16 represent critical steps toward improving model performance in complex NLP tasks. The development of domain-specific models like PoliBERTweet³⁰ highlights the potential benefits of customizing language models to better capture specific content areas, such as political discourse. The integration of symbolic reasoning with deep learning in SenticNet 6 further highlights the importance of combining different AI approaches to enhance understanding and interpretation of subtle linguistic features³¹.
However, challenges remain regarding the ethical and practical implications of relying on LLMs. Concerns include model bias, the interpretability of results, and the tendency of LLMs to produce plausible but incorrect or biased outputs^32,33. Additionally, studies have shown that while LLMs excel in language tasks, their performance in detecting sarcasm and nuanced emotions is inconsistent³⁴.
The consistency of LLMs over time is another area of interest. Although not updated by service providers, models that are prompted on different instances, may produce different outputs on the same input, raising questions about consistency in longitudinal studies³⁵. LLMs can be sensitive to input phrasing, leading to different interpretations based on slight changes in wording³⁶.
Human annotators have long been the gold standard in content analysis due to their ability to understand context, cultural references, and subtle language cues⁷. Inter-rater reliability metrics such as Krippendorff’s alpha are used to assess consistency among human coders³⁷. Comparing LLM performance against human benchmarks is essential to evaluate their viability as substitutes or supplements in content analysis tasks.
While Large Language Models (LLMs) have demonstrated impressive capabilities, there is a notable lack of comprehensive evaluations comparing their performance to human annotators across multiple dimensions of latent content analysis. Existing studies often focus on single tasks or lack extensive statistical analysis of reliability and quality³⁴. Additionally, the consistency of LLMs over time and their inter-rater reliability in capturing complex linguistic features remain underexplored. To address these gaps this study formulates the following research questions:
RQ1: How reliably do LLMs compare to human annotators across multiple dimensions of latent content analysis? Although the field of content analysis has advanced from manual coding schemes⁷, through machine learning introduced algorithms capable of handling larger datasets with increased efficiency⁸, and finally peaked with deep learning architectures¹⁰, research is still needed especially in interpreting complex linguistic features such as sarcasm and nuanced emotions¹³. On the other hand, other studies show that machines underperform in comparison to humans in certain tasks³⁸. One big gap in these studies is the lack of comparison with human annotations – to compare how LLMs and humans annotate complex linguistic features and whether humans are better at these tasks.
Previous studies have focused on specific aspects of content analysis, such as sentiment classification²⁸ and sarcasm detection⁶, but there is limited research on how the level of reliability between LLMs and humans varies across different dimensions. This is another gap to be addressed in this study.
RQ2: Are LLMs consistent over time when analyzing textual content? Consistency over time is another underaddressed issue. Models used in different instances can create different outputs in different instances³⁵. LLM outputs can vary due to minor prompt modifications, raising concerns about time consistency³⁶.
RQ3: To what extent do LLMs provide analysis that is comparable to human analysis in terms of quality? Human annotators have long been the gold standard in content analysis due to their ability to understand context, cultural references, and subtle language cues⁷. However, the ability of LLMs to learn from the context is being examined²⁷, as well as their ability to produce human-like texts²⁹ is rising, with an aim to replace human annotators with LLMs. Even though these studies are rising in numbers, quality check studies comparing humans and LLMs are still lacking, and this is another gap to be addressed in this study.
RQ4: How do inter-rater reliability and comparability vary across different LLM models? Previous studies examined some LLM models’ success in annotation, but usually in one or two tasks³⁴, and usually using one or two models³⁸ in this study we aim to close this gap by including multiple models and multiple tasks, and examine all their inter-rater reliability and consistency. Given the rapid development and diversity of LLM architectures—each trained on varying datasets and employing different model sizes—it’s crucial to understand whether these differences translate into variations in content analysis outcomes.
To evaluate the inter-rater reliability and quality of large language models (LLMs) in latent content analysis, we conducted a comparative study involving both humans and eight types of LLMs that each responded to presented queries to evaluate content by assigning values to statements. Our objective was to benchmark the performance of LLMs and humans across four key dimensions: sentiment, political leaning, emotional intensity, and sarcasm detection by performing (a) within (internal consistency) and (b) between analyses (comparison of performance).
The ethical approval was acquired from the Ethics Committee prior to this research. All methods were carried out in accordance with relevant guidelines and regulations.
The study involved 33 human annotators who were proficient in English. The group brought substantial academic and professional expertise to the study. The sample included 81.8% of annotators holding PhDs and various academic titles ranging from PostDocs to Full and Associate Professors. The annotators included experts from disciplines such as Social Psychology, Communication Science, Linguistics, and Computing and Information Technology. Their affiliations spanned 18 European countries, including Poland, Albania, Czech Republic, Serbia, Portugal, Turkey, Bosnia and Herzegovina, Spain, Austria, Norway, Cyprus, Belgium, Germany, Netherlands, Romania, United Kingdom, France and Ireland. Annotators were integral members of the COST Action Network CA21129, which focuses on integrating theoretical and methodological approaches to analyzing opinionated communication³⁹. This diverse expertise facilitated a robust analysis of opinions, enriching the study with interdisciplinary insights.
In addition to human participants, seven state-of-the-art LLMs were selected for evaluation: GPT-3.5, GPT-4, GPT-4o, GPT-4o-mini, Gemini, Llama-3.1-70B and Mixtral 8 × 7B. Additionally, GPT-4o was prompted in a different way through interplay of various agents which we called a hard prompt. Thus, we had eight variations of LLM. These models were chosen to represent a range of architectures and training data, providing a comprehensive overview of current LLM capabilities.
As it can be seen in Table 1, some models, notably GPT-4 and Gemini 1.5 Pro, do not have publicly disclosed information regarding certain architectural parameters such as the number of attention heads and layers. This lack of transparency highlights the proprietary nature of these models and underlines the challenges faced by the research community in performing direct comparisons.
Each LLM was accessed via its respective application programming interface (API) or interface under appropriate usage agreements. To assess consistency over repeated attempts, each LLM was prompted to evaluate the same set of textual items three times, yielding a total of 19,200 LLM-generated annotations.
The textual items used for annotation⁴⁰ were curated and, where appropriate, adapted from existing literature and established datasets commonly used in the study of sentiment analysis, political leaning, emotional intensity, and sarcasm detection. This approach ensured that the sentences were representative of the types of content typically encountered in real-world situations and aligned our study with prior research methodologies.
To enhance the validity and comparability of our study, we referred to several well-established datasets and research studies in the field of natural language processing (NLP) and sentiment analysis. These included the Stanford Sentiment Treebank¹² and the Sentiment140 dataset⁴¹ for sentiment analysis; studies by Iyyer et al.⁴² for political ideology in text; the EmoBank corpus⁴³ for emotional intensity; and the Sarcasm Corpus V2⁴⁴ for sarcasm detection. By using these sources, we aimed to align our sentences with the standards in the field and to ensure that our results are comparable to previous research.
For each dimension, we developed 25 unique sentences, resulting in a total of 100 sentences for analysis.
In the sentiment analysis dimension, sentences were crafted to represent a full range of sentiments from strongly negative to strongly positive. For example, a sentence representing strong negative sentiment was adapted from examples in Pang and Lee⁴⁵: “I hate everything about this product. It’s a complete waste of money and time.” A neutral or mixed sentiment sentence was: “The report had its ups and downs. Some sections were really informative, while others were lacking depth.” An example of strong positive sentiment, adapted from Socher et al.¹², was: “I absolutely love this place! The service is fantastic and the food is incredible.” By including sentences with varying emotional tones, we aimed to test the annotators’ and LLMs’ abilities to accurately perceive and rate the sentiment expressed.
For the political leaning dimension, sentences were designed to reflect a spectrum of political opinions from strongly left-leaning to strongly right-leaning. Drawing from themes in Iyyer et al.⁴², a strongly left-leaning sentence was: “Universal healthcare is essential for a just and equitable society.” A neutral or centrist sentence was: “Economic policies should balance the needs of business growth and social welfare.” A strongly right-leaning sentence was: “Lowering taxes is the best way to stimulate economic growth and individual freedom.” These sentences incorporated references to common political themes and policies, allowing us to evaluate how well annotators and LLMs could detect and interpret ideological cues.
In the emotional intensity dimension, sentences were constructed to exhibit varying levels of emotional expression, guided by the EmoBank corpus⁴³. A sentence with very low emotional intensity, similar to neutral sentences in EmoBank, was: “The data from the recent survey shows a slight increase in customer satisfaction.” A sentence with moderate emotional intensity was: “The announcement was met with mixed feelings; some viewers expressed joy while others felt disappointment.” For very high emotional intensity, we used: “In an outburst of euphoria, he shouted and danced around, his joy uncontainable.” By varying the language from factual and straightforward to vividly expressive, we challenged the annotators and LLMs to discern subtle differences in emotional intensity.
For sarcasm detection, sentences ranged from literal statements to overtly sarcastic remarks. We examined the Sarcasm Corpus V2⁴⁴ and examples from studies like Riloff et al.⁴⁶ to incorporate sentences exhibiting different levels of sarcasm. A non-sarcastic sentence was: “I’m delighted with the new features in the app; it’s exactly what we needed.” A moderately sarcastic sentence, adapted from Riloff et al.⁴⁶, was: “Oh great, another meeting. Just what I needed.” An overly sarcastic sentence, common in sarcasm datasets, was: “You’ve really outdone yourself this time.” These sentences were designed to test the ability of annotators and LLMs to detect sarcasm, which often relies on contextual cues and can be challenging to interpret in written form.
While crafting and selecting these sentences, we adhered to several considerations. First, we wanted to make sure our materials were consistent with prior research, facilitating comparability and enhancing the validity of our findings. Second, we ensured variability across the scales by choosing sentences to represent the full spectrum of each dimension’s Likert scale, allowing for a thorough evaluation of annotators’ and LLMs’ abilities to distinguish between different levels. Third, we included a diversity of content, covering various topics and contexts such as products, services, policies, and everyday situations, to mimic the diversity found in real-world text data. Fourth, sentences were designed to be self-explanatory, providing sufficient context for accurate annotation without requiring external information. Fifth, we avoided including sentences that could be biased, culturally insensitive, or offensive, ensuring ethical considerations were met. Finally, although some sentences were adapted from existing sources, they were paraphrased or modified where necessary to fit the specific grading criteria and to avoid direct replication of copyrighted material.
Annotators were provided with detailed instructions and training materials to ensure a consistent understanding of the annotation tasks. All annotations employed a 5-point Likert scale tailored to each dimension, including Sentiment: 1 (Strongly Negative Sentiment) to 5 (Strongly Positive Sentiment); Political Leaning: 1 (Strongly Left-Leaning) to 5 (Strongly Right-Leaning), Emotional Intensity: 1 (Very Low Emotional Intensity) to 5 (Very High Emotional Intensity) and Level of Sarcasm: 1 (Not Sarcastic) to 5 (Overly Sarcastic).
Human grading was administered during the COST Opinion meeting in Salamanca, Spain on 12/06/2024, utilizing Google Forms⁴⁰. Human annotators individually assessed all 100 textual items to ensure evaluations were independent and uninfluenced by others. They were instructed to rely solely on the text provided, without consulting external resources. The instructions emphasized the importance of consistency and attention to distinctions in the text.
All human participants provided informed consent before participating in the study. They were briefed on the study’s purpose, procedures, and their right to withdraw at any time without penalty. To protect participants’ privacy, all responses were anonymized, and data were stored securely. The study design was reviewed and approved by the institutional review board to ensure adherence to ethical standards.
Efforts were made to minimize potential biases in the study. The selection of textual items aimed at diversity in content to avoid cultural or contextual biases that could affect annotations. Instructions for both human annotators and LLMs emphasized neutrality and objectivity in evaluations.
For the LLMs, we crafted standardized prompts that mirrored the instructions given to human annotators. Prompt design was critical to ensure comparability between human and LLM annotations. Each prompt included clear instructions and specified the annotation scale.
Eight variations of LLM testing included seven LLMs with addition to slightly changed Hard Prompt GPT-4o. This meant giving additional instructions to the language model, that initiated role play of 3 agents. LLM was asked to describe two agents most suitable to do the grading and then the third agent to reach the final decision. This is how the additional instruction was formulated only in the Hard Prompt:
Evaluate the qualifications and attributes of Agent 1 and Agent 2, detailing why each agent is best suited to grade the provided text. Additionally, describe the role and qualifications of Agent 3, who will serve as the judge to make the final grading decision. After assessing the text, provide the grades given by Agent 1 and Agent 2, followed by the final decision rendered by Agent 3.
In the Hard Prompt setting, GPT-4o was configured to simulate a deliberative process by role-playing three virtual agents. First, the model generated responses from two agents, each providing their own preliminary assessments. Then, a third agent synthesized these evaluations to produce a final grading of the provided content. This approach aimed to mimic multi-perspective human deliberation and potentially enhance the quality of annotations.
Regular Prompts for GPT-3.5, GPT-4, GPT-4o, GPT-4o-mini, Gemini, Lamma-3.1, Mixtral 8 × 7B, and Hard Prompt for GPT-4o were administered on 12/08/2024, 13/08/2024 and 14/08/2024 in the same time-frame of the day from approximately 18:31 until 21:08⁴⁰.
Data analysis was performed using statistical software packages SPSS (Version 26) and Python 3. These tools facilitated the computation of descriptive statistics, inter-rater reliability coefficients, t-tests, ANOVAs, and effect sizes. The LLMs were accessed through their respective APIs or interfaces, ensuring consistency in how prompts were delivered and responses were recorded.
In our study, we conducted data analysis using Python, and SPSS, with each platform designated for specific tasks to facilitate reproducibility. For instance, Krippendorff’s alpha for each latent content dimension—sentiment, political leaning, emotional intensity, and sarcasm detection—were calculated using Python libraries “nltk” and “random”. These packages allowed us to perform 1000 simulations on random annotator subsets and compute robust inter- and intra-rater reliability metrics.
We used SPSS for t-tests, Levine’s test, one-way ANOVAs, and post hoc tests (with Bonferroni correction) to assess group differences among the human annotators and diverse LLM outputs. SPSS was also used to calculate intra-class correlation coefficients (ICCs) for each latent content dimension for each model.
In parallel, Python was used for data preprocessing, visualization, and additional statistical testing. Python packages, including “pandas” for data manipulation, and “seaborn” and “matplotlib” for generating boxplots and other figures, were integral to our workflow. Detailed scripts in Python handled data cleaning, assumption checks (e.g., for homogeneity of variances), and produced visual representations of the data distributions and ICC trends over time.
RQ1. To address RQ1we performed within-group analysis first, by assessing inter-rater reliability among human annotators and among LLMs. To do so, we calculated Krippendorff’s alpha for each dimension. Krippendorff’s alpha is suitable for evaluating reliability among multiple raters using ordinal data and accounts for the possibility of reliability occurring by chance³⁷. High values of Krippendorff’s alpha indicate strong reliability among annotators⁴⁷. For each dimension, we conducted 1000 simulations using random subsets of one-third of our total pool of 33 annotators, and we visualized the results in a boxplot to illustrate the distribution and variability of Krippendorff’s alpha within these subsets. The same methodology was applied to the LLMs. Additionally, the mean value from these simulations is presented on the boxplot for each dimension, estimating the inter-rater reliability that might be expected across all annotators, or LLMs thereby facilitating comparative analysis.
RQ2. To address RQ2, which investigated whether Large Language Models (LLMs) are consistent over time when analyzing textual content, we conducted a repeated measures analysis focusing on the intra-model consistency of each LLM across the three time points. Each LLM evaluated the same set of 100 textual items on three separate occasions, allowing us to assess the stability of their ratings over time. For each LLM and dimension, we calculated the Intra-Class Correlation Coefficient (ICC) to quantify the degree of consistency in the ratings across the three time points. The ICC measures the proportion of variance in the ratings due to the items being rated, relative to the total variance including measurement error⁴⁸. ICC values range from 0 to 1, with higher values indicating greater consistency (where values below 0.5 indicate poor reliability, between 0.5 and 0.75 moderate reliability, between 0.75 and 0.9 good reliability, and any value above 0.9 indicates excellent reliability).
RQ3. For RQ3, we examined the extent to which LLM outputs align with human annotators’ judgments across four dimensions. First, we conducted independent-sample t-tests comparing the mean annotations of human coders to the combined mean of all LLMs for each dimension. Where Levene’s test indicated unequal variances, Students’ t-test was used. We then performed one-way ANOVAs to evaluate differences among the nine groups (eight LLMs and the human annotators), followed by Bonferroni post-hoc tests if significant main effects were detected. Finally, we compared the mean dimension ratings between humans and each individual LLM to determine which models most closely replicated human judgments.
We followed procedures outlined in statistical textbooks⁴⁹ suggesting t-test provide a robust method for comparing the mean differences between two groups, more robust in comparison to simple mean comparison, as they also provide information whether the existing difference is statistically significant or just a product of a chance. For example, if effect size is exceptionally small, this could indicate that while differences may be statistically significant, they are not necessarily meaningful in real-world applications.
RQ4. To address RQ4, which examines the extent to which inter-rater reliability and comparability vary across different LLM models We conducted separate one-way analyses of variance (ANOVAs) for each dimension, with LLM type as the independent variable. Where overall differences were observed, we applied Bonferroni post-hoc tests to identify specific pairwise group differences. Additionally, we computed Krippendorff’s alpha to assess the inter-model reliability for each dimension, indicating the extent to which different LLMs converged on similar annotations.
The comparative analysis between human annotators and large language models (LLMs) across the four dimensions of latent content analysis—sentiment, political leaning, emotional intensity, and sarcasm detection—revealed insightful findings about the inter-rater reliability and quality of LLMs in replicating human judgments.
Inter-Rater Reliability. To address RQ1, which focused on assessing the inter-rater reliability among LLMs and human annotators, Krippendorff’s alpha was calculated for both human annotators and LLMs in each dimension³⁷. In terms of Sentiment Analysis, an alpha coefficient of 0.95 indicated a very high level of reliability among the 33 human annotators (Fig. 1). The narrow interquartile range (IQR) suggested minimal variability, highlighting consensus in evaluating sentiment. Concerning the Political Leaning the alpha value was 0.55, reflecting moderate reliability with a broader IQR. This variability suggests differences in perceiving political leanings, possibly due to subjective interpretations or nuanced content. In Emotional Intensity, an alpha of 0.65 signified fair to good reliability among annotators. While reliability was better than for political leaning, the presence of variability indicated challenges in consistently assessing emotional intensity. Regarding Sarcasm Detection, an alpha of 0.25 pointed to the low reliability among annotators, with a wide IQR and outliers. This low consistency means the inherent difficulty in detecting sarcasm, even among human judges.
Krippendorff’s alpha values for inter-rater reliability among human annotators across four dimensions: sentiment, political leaning, emotional intensity, and sarcasm.
To assess the internal consistency of LLMs, the Krippendorff’s alpha values for Sentiment Analysis reached 0.95, matching human reliability levels and demonstrating consistency in sentiment evaluation (Fig. 2). For Political Leaning, an alpha of 0.80 indicated higher reliability among LLMs than human annotators, suggesting that LLMs may interpret political cues more uniformly. In Emotional Intensity, an alpha of 0.85 represented high reliability, again exceeding human consistency. This finding implies that LLMs were more consistent among themselves when assessing emotional intensity. Finally, in Sarcasm Detection, an alpha of 0.25, similar to that of humans, revealed low reliability among LLMs, highlighting the shared challenge in accurately detecting sarcasm. These inter-rater reliability results suggest that LLMs can achieve levels of consistency comparable to or exceeding those of human annotators in certain dimensions, particularly in sentiment analysis and emotional intensity.
Krippendorff’s alpha values for inter-rater reliability of LLMs across four dimensions: sentiment, political leaning, emotional intensity, and sarcasm.
ICCs. Most LLMs exhibited excellent temporal consistency across most of the dimensions, with lowest values observed in sarcasm detection. The highest consistency was observed in the Sentiment Analysis dimension, where ICCs were consistently above 0.990 for all models. This indicates that the LLMs’ sentiment evaluations were highly stable over the three time points.
In the Political Leaning dimension, ICCs were slightly lower but still indicated high consistency, ranging from 0.965 to 0.997, with the exception of Gemini who didn’t perform well. This suggests that most LLMs provided stable assessments of political leaning over time, despite the potential complexity involved in interpreting political content.
For Emotional Intensity, ICCs ranged from 0.966 to 0.996, demonstrating strong consistency among the LLMs. Although this dimension involves subjective interpretation of emotional expression, the LLMs maintained a high level of consistency in their ratings across time.
In the Sarcasm Detection dimension, ICCs were the lowest among the four dimensions, with most of them showing moderate consistency, with the exception of GPT-4, Gemini, and Hard Prompt GPT-4o showing very low consistency. This mirrors the findings from RQ1, where both humans and LLMs showed low inter-rater reliability in sarcasm detection, suggesting inherent challenges in interpreting sarcasm.
The ICCs for each LLM across the four dimensions are presented in Table 2. To further examine the temporal stability of the LLMs, we calculated the mean standard deviation of ratings across the three time points for each LLM and dimension. The mean standard deviations were low across all LLMs and dimensions, further confirming the high temporal consistency of the models.
T-tests. To evaluate the extent to which LLMs provide analysis comparable to human annotators (RQ3), independent sample t-tests were conducted for each dimension.
In sentiment analysis, the mean rating from human annotators was 3.19 (SD = 0.11), while LLMs had a mean rating of 3.22 (SD = 0.08). The t-test indicated no significant difference between the two groups, t (55) = -1.097, p = .277, with a small effect size (Cohen’s d = -0.29). This suggests that LLMs perform on par with humans in evaluating sentiment, providing ratings that are statistically and practically similar.
For political leaning, human annotators had a mean rating of 2.89 (SD = 0.25), and LLMs had a mean of 2.82 (SD = 0.14). The t-test, adjusted for unequal variances due to a significant Levene’s test, showed no significant difference, t (51.22) = 1.366, p = .178, with a small effect size (Cohen’s d = 0.33). This indicates that LLMs’ assessments of political leaning are comparable to those of human annotators, despite some variability.
In emotional intensity, human annotators’ mean rating was 3.44 (SD = 0.34), whereas LLMs had a mean of 3.19 (SD = 0.17). The t-test revealed a significant difference, t(49.42) = 3.615, p < .001, with a large effect size (Cohen’s d = 0.88). This significant disparity indicates that humans perceived and rated emotional intensity higher than LLMs, suggesting that LLMs may underrepresent the emotional nuances apparent to human annotators.
Regarding sarcasm detection, humans had a mean rating of 3.75 (SD = 0.50), and LLMs had a mean of 3.89 (SD = 0.51). The t-test showed no significant difference between the groups, t(55) = -1.002, p = .323, with a small effect size (Cohen’s d = -0.27). This result indicates that both humans and LLMs struggled with sarcasm detection, providing statistically similar but variable ratings.
ANOVAs. One-way ANOVAs were conducted for each dimension to assess differences among the nine groups (eight LLMs and human annotators).
In sentiment analysis, the ANOVA revealed no significant differences among groups, (F (8, 48) = 1.514, p = .177, η² = 0.20), indicating small effect size, and that mean sentiment ratings were consistent across humans and LLMs.
For political leaning, the ANOVA showed no significant differences, (F (8, 48) = 0.688, p = .700, η² = 0.10), suggesting that LLMs’ political leaning assessments did not significantly differ from human annotators or among themselves.
In emotional intensity, significant differences were found among groups, (F (8, 48) = 2.256, p = .039, η² = 0.27). Post hoc tests (Bonferroni correction) revealed that human ratings significantly differed from those of certain LLMs. Humans rated emotional intensity higher than GPT-3.5 (p < .001), Mixtral 8 × 7B (p = .004), and Gemini (p = .006). These findings indicate that some LLMs may consistently underreport emotional intensity compared to human annotators.
In sarcasm detection, the ANOVA showed marginal differences, (F (8, 48) = 2.126, p = .051, η² = 0.26), hinting at potential differences among groups. While specific significant differences were not conclusively identified, this suggests variability in how LLMs and humans detect sarcasm.
Mean ratings. To identify which LLMs most closely replicate human performance, we compared the mean ratings of each model to human means across all dimensions (Table 3).
In sentiment analysis, the LLM GPT-4o-mini had a mean rating of 3.19, identical to the human mean. This exact match suggests that GPT-4o-mini provided sentiment evaluations highly aligned with human judgments. Other LLMs showed slight deviations; for example, GPT-3.5 had a higher mean of 3.35, indicating a tendency to rate sentiment more positively.
For political leaning, GPT-4’s mean rating of 2.90 was closest to the human mean of 2.89, demonstrating strong alignment. Other models, such as GPT-4o (M = 2.96) and Llama-3.1-70B (M = 2.81), also provided ratings similar to human assessments, though with minor differences.
In emotional intensity, GPT-4 again closely matched human ratings with a mean of 3.43, compared to the human mean of 3.44. This similarity suggests that GPT-4 is more adept at capturing emotional nuances than other LLMs, such as GPT-3.5 (M = 3.00), which rated emotional intensity lower than humans.
For sarcasm detection, the LLM Mixtral 8 × 7B had a mean rating of 3.95, closest to the human mean of 3.75. However, there was greater variability among LLMs in this dimension, with some models like GPT-4 providing higher mean ratings (4.36), indicating a tendency to perceive more sarcasm than human annotators.
ANOVAs. One-way ANOVAs were conducted for each dimension to assess differences between LLM types.
In sentiment analysis, the ANOVA revealed significant differences among groups, (F (7, 16) = 3.914, p = .011, η² = 0.63), indicating medium effect size, and that mean sentiment ratings were consistent across LLMs. Bonferroni post-hoc test revealed interesting results – GPT-3.5-turbo-16k performs better in comparison to GPT-4o and GPT-4oH_Hard (p < .05).
For political leaning, the ANOVA showed no significant differences, (F (7, 16) = 1.871, p = .142, η² = 0.45), suggesting that LLMs’ political leaning assessments did not significantly differ by LLM type.
In emotional intensity, significant differences were found among groups, (F (7, 16) = 15.904, p < .001, η² = 0.87), with large effect size. Post hoc tests (Bonferroni) revealed that GPT-4 performs better than GPT-4oH_Hard, GPT-3.5, and Mixtral 8 × 7B (p < .001), GPT-4o performs better than GPT-3.5 (p = .023), GPT-4o-mini performs better than GPT-3.5 and GPT-4oH_Hard (p < .05), Llama-3.1-70B performs better than GPT-3.5, GPT-4oH_Hard and Mixtral 8 × 7B (p < .001).
In sarcasm detection, the ANOVA showed significant differences as well, (F (7, 16) = 3.111, p = .028, η² = 0.58), with medium effect size. However, post hoc test (Bonferroni) didn’t reveal any significant differences, with only one being seemingly close to significance – GPT-4 performing better than GPT-3.5 (p = .069), meaning potentially that the sample size is not big enough to catch small effect size of this difference.
Krippendorff’s alpha. We assessed the inter-rater reliability among different LLMs provides insights into their consistency and potential for reliable application (Fig. 3).
For sentiment analysis, individual LLMs exhibited Krippendorff’s alpha values close to 1.00, indicating near-perfect reliability among models. This high consistency suggests that LLMs interpret sentiment remarkably similarly, regardless of architectural differences.
For political leaning, alpha values varied, with higher reliability among GPT-4 variants (alphas nearing 0.80) and lower reliability in models like Gemini (α below 0.00), which indicated less consistency. This variability suggests that some LLMs may interpret political cues differently.
In emotional intensity, GPT-4 variants showed high reliability (alphas around 0.85), while other models had moderate reliability. The consistent performance of GPT-4 models implies they may be better suited for tasks requiring sensitivity to emotional content.
In sarcasm detection, all LLMs exhibited low reliability (alphas around 0.25), mirroring the challenges faced by human annotators. The low internal consistency underscores the complexity of sarcasm detection across both human and machine interpretations.
Krippendorff’s alpha values for different Large Language Models (LLMs) across four dimensions: sentiment, political leaning, emotional intensity, and sarcasm.
The findings of this study illuminate the capabilities and limitations of large language models (LLMs) in performing latent content analysis tasks traditionally undertaken by human annotators. By comparing the performance of several state-of-the-art LLMs to that of human annotators across four dimensions—sentiment, political leaning, emotional intensity, and sarcasm detection—we gain insights into the potential for integrating these models into practical applications and the areas where human judgment remains indispensable.
This section offers formal answers to the research questions and discusses implications, connections to previous research, and practical insights. It also addresses limitations and suggests directions for future research.
RQ1: How reliably do LLMs compare to human annotators across multiple dimensions of latent content analysis? The study found that both humans and LLMs exhibit varying levels of inter-rater reliability across different dimensions of latent content analysis. In sentiment analysis, both groups demonstrated very high inter-rater reliability, with Krippendorff’s alpha values of 0.95 for both humans and LLMs, indicating strong consensus within each group. This suggests that sentiment is a dimension where both humans and LLMs can reliably assess content with minimal ambiguity.
Regarding political leaning, human annotators showed moderate reliability with an alpha value of 0.55, while LLMs exhibited higher consistency with an alpha of 0.80. The moderate reliability among humans may reflect subjective interpretations and individual biases in perceiving political leanings, whereas LLMs, relying on learned patterns from large datasets, provided more uniform assessments.
In the dimension of emotional intensity, human annotators achieved fair to good reliability (alpha = 0.65), whereas LLMs demonstrated higher consistency (alpha = 0.85). This indicates that while humans may differ in their perceptions of emotional intensity due to personal experiences and emotional intelligence, LLMs follow more standardized patterns in their evaluations.
For sarcasm detection, both humans and LLMs struggled, with low inter-rater reliability (alpha = 0.25 for both groups). This low reliability emphasizes the inherent difficulty in interpreting sarcasm, which often relies on context, tone, and cultural distinctions that are challenging to discern in written text alone.
RQ2: Are LLMs consistent over time when analyzing textual content? LLMs are mostly consistent over time when analyzing textual content, with lower consistency seen for sarcasm. By prompting each LLM to evaluate the same set of texts three times, the study observed minimal variation in their ratings across these repetitions. The low standard deviations in the LLMs’ responses indicate high intra-model consistency, meaning that most models produce stable and repeatable outputs upon re-evaluation of the same content.
These findings are notable given concerns in prior research about the temporal stability of LLM outputs^35,36. Our results suggest that, when using standardized prompts and controlling for external variables, LLMs can provide reliable and consistent analyses over multiple instances, but not for all topics, and not all models. Some models are exceptionally low in consistency on some topics (e.g., Gemini for political leaning and sarcasm, and GPT-4 and Hard Prompt GPT-4o for sarcasm as well).
RQ3: To what extent do LLMs provide analysis that is comparable to human analysis in terms of quality? LLMs provide analysis that is comparable to human annotators in terms of quality for certain dimensions of latent content analysis. In sentiment analysis and political leaning, LLMs matched human performance closely, both in mean ratings and inter-rater reliability measures, indicating high-quality analysis. However, for dimensions such as emotional intensity and sarcasm detection, LLMs did not fully match human analysis. The significant differences in emotional intensity ratings and low agreement in sarcasm detection suggest that while LLMs are effective for tasks involving clear and direct language cues, they may not yet achieve the same level of quality as humans in interpreting complex or subtle textual elements.
RQ4: How do inter-rater reliability and comparability vary across different LLM models? The study revealed significant variations in inter-rater reliability and comparability across different LLM models.
While all LLMs demonstrated high reliability in sentiment analysis, with Krippendorff’s alpha values close to 1.00, indicating near-perfect consistency, differences emerged in other dimensions.
In political leaning assessments, models like GPT-4 and its variants showed higher reliability levels (alpha ≈ 0.80) compared to others like Gemini, which had lower reliability (alpha below 0.00). This suggests that certain models are more adept at interpreting political cues, possibly due to their training data and architecture.
For emotional intensity, GPT-4 displayed high reliability (alpha = 0.85), indicating its superior capacity to capture emotional complexities. In contrast, models like GPT-3.5 tended to rate emotional intensity lower than humans and other LLMs, highlighting variations in sensitivity to emotional content across models.
In sarcasm detection, all LLMs, regardless of the model, exhibited low reliability levels (alpha = 0.25), mirroring human difficulties. However, the mean ratings varied among models; for instance, GPT-4 tended to perceive more sarcasm (mean rating = 4.36) compared to models like Gemini (mean rating = 3.19). These differences suggest that certain LLMs may have a bias towards detecting or overestimating sarcasm.
The ANOVA analyses further supported these findings, showing significant differences among LLMs in sentiment analysis, emotional intensity, and sarcasm detection. Post hoc tests indicated that GPT-4 and its variants often performed better than other models, particularly in emotional intensity and sarcasm detection, although the differences in sarcasm were not always statistically significant.
One important factor underlying the variations observed in LLM performance is the distinct architecture and training data employed by each model. For instance, GPT-4 and its variants (GPT-4o, GPT-4o-mini, Hard Prompt GPT-4o) benefit from a larger parameter space and a more recent training corpus, which likely contribute to their strong performance in areas requiring complex understanding, such as emotional intensity. GPT-4’s use of extensive reinforcement learning from human feedback (RLHF) can also refine its ability to interpret subtle linguistic cues more reliably. By contrast, GPT-3.5—though still highly capable—has fewer parameters and is trained on an earlier data cut, which may explain notable discrepancies in capturing emotional depth or sarcastic undertones.
Other models, such as Gemini, Llama-3.1-70B, and Mixtral 8 × 7B, may vary in performance because of differing training objectives, tokenization strategies, or domain-specific pretraining. For example, Llama-3.1-70B’s performance in political leaning tasks can be partially attributed to the structure of its training data, which may place more emphasis on political discourse or formal text, thereby leading to more consistent interpretations. Gemini and Mixtral 8 × 7B, meanwhile, appear more susceptible to underreporting emotional intensities or overestimating sarcasm, perhaps due to less exposure to the rich, context-driven examples that tend to fine-tune such skills.
These variations imply that the choice of LLM significantly impacts the inter-rater reliability and quality of content analysis. Practitioners should carefully select LLMs based on the specific dimension of analysis and consider combining outputs from multiple models or integrating human oversight for dimensions where models show less agreement or diverge significantly from human judgments.
The study’s findings indicate that LLMs can perform latent content analysis tasks with inter-rater reliability and quality comparable to human annotators in certain dimensions. In sentiment analysis and political leaning, LLMs provided ratings comparable to humans, with no significant differences and high consistency. GPT-4 and its variants, in particular, showed performance closely aligning with human judgments in these dimensions.
In emotional intensity, although GPT-4 closely matched human ratings, significant differences existed between humans and some LLMs, with humans rating emotional intensity higher. This suggests that while LLMs can approximate human assessments, they may underrepresent emotional distinctions.
In sarcasm detection, both humans and LLMs faced significant challenges, evidenced by low inter-rater reliability. The inherent complexity of sarcasm likely contributes to this difficulty, indicating a need for further advancements in computational models and methodologies to better capture this aspect of language.
The results suggest that advanced LLMs, particularly GPT-4, have the potential to serve as reliable substitutes for human annotators in latent content analysis tasks such as sentiment analysis and political leaning assessments. This capability could greatly enhance the efficiency of analyzing large volumes of textual data, supporting applications in social media monitoring, market research, and political analysis.
However, the findings also highlight limitations of LLMs in accurately assessing emotional intensity and detecting sarcasm. These dimensions involve complex human emotions and contextual subtleties that current LLMs may not fully capture. Therefore, human expertise remains crucial in these areas, and a hybrid approach combining LLM efficiency with human judgment may be most effective.
The results align with prior studies indicating that LLMs have achieved proficiency in sentiment analysis and can rival human performance in certain tasks^28,29.
The high consistency and comparable mean ratings in sentiment analysis and political leaning suggest that LLMs have matured in their ability to interpret affective language. Previous literature has noted that LLMs can adopt biases present in training data³², which could influence their interpretations of political content.
However, the significant difference in emotional intensity ratings, with humans providing higher scores than LLMs, indicates that LLMs may underrepresent the depth of emotional content perceived by humans. This finding resonates with earlier research highlighting challenges in computational models capturing nuanced emotional expressions¹³. It suggests that while LLMs can detect the presence of emotion, they may struggle with assessing its magnitude accurately.
The findings have practical implications for various fields that rely on content analysis. In domains such as marketing, social media monitoring, and public opinion research, LLMs could serve as efficient tools for sentiment analysis, reducing the need for extensive human annotation and accelerating data processing times.
For political analysis, LLMs can assist in aggregating and evaluating large datasets to identify trends and shifts in public discourse. However, practitioners should be cautious of potential biases and consider combining LLM outputs with human oversight to ensure correct interpretations are captured.
The challenges identified in emotional intensity and sarcasm detection suggest that human expertise remains crucial in these areas. Applications requiring deep emotional understanding, such as mental health assessments or customer experience analysis, may benefit from a hybrid approach that leverages LLM efficiency while incorporating human judgment for depth and accuracy.
Several limitations of this study warrant consideration. The sample size of human annotators, while sufficient for statistical analysis, may not capture the full diversity of human interpretations influenced by cultural, social, and individual differences, given that cultural differences not only at the individual but group level have been argued for example in organizational cultures⁵⁰. Human biases inevitably shape annotation outcomes, particularly because annotators’ backgrounds—such as cultural norms, beliefs, and disciplinary training—can influence their interpretation of textual cues. For instance, what one individual perceives as sarcastic or highly emotional may differ in another cultural context. To mitigate these biases, future research could employ a more diverse pool of annotators, spanning multiple linguistic and cultural backgrounds. Structured calibrations and follow-up discussions (e.g., focus groups) among annotators could help surface differing frame-of-reference issues and improve consensus. It is also advised to implement standardized bias detection and de-biasing protocols, such as iterative feedback loops where annotators revisit items after group discussion, in order to minimize the subjective skew and increase reliability.
The selection of textual items, though diverse, may not encompass the entire spectrum of complexity found in natural language. Certain texts might inherently favor LLM processing due to their structure or content, while others may present challenges not fully represented in the sample.
Finally, the study relied on the current versions of the selected LLMs. As these models are continually updated and fine-tuned, their performance may evolve. Future research should consider longitudinal studies to assess how LLM capabilities change over time.
Based on the findings, several future research directions emerge, inviting further exploration and innovation in key areas of machine learning and artificial intelligence.
One promising avenue is enhancing emotional understanding in large language models (LLMs), with a focus on boosting sensitivity to emotional intensity. This might be achieved by training models on specialized datasets designed to capture emotional depth and diversity.
Another critical area is the advancement of sarcasm detection techniques. To tackle the complexities and context-dependent nature of sarcasm, researchers could develop sophisticated models or integrate multimodal data, such as contextual metadata and user profiles.
Future studies should investigate intrinsic LLM biases and devise methods to mitigate them, ensuring interpretations are both fair and accurate.
Expanding research to include diverse cultural contexts can enhance LLM performance. By incorporating texts and contributions from a variety of cultural backgrounds, researchers can evaluate how LLMs operate across different linguistic and cultural landscapes, facilitating a more inclusive analysis.
One promising direction for future research is to extend the current framework to more specialized domains, such as mental health. For instance, future studies could investigate how LLMs perform when provided with domain-specific guidelines or clinical practice standards, as suggested by recent work⁵¹. Incorporating specialized diagnostic criteria or knowledge infusion techniques⁵² could enhance the LLMs’ ability to detect subtle cues in mental health-related texts, ultimately broadening the applicability of latent content analysis beyond conventional sentiment and opinion studies.
The use of different prompting strategies, such as chain of thought or few-shot prompting⁵³, to improve analysis results is another promising area for future exploration. Additionally, it could be interesting to evaluate the success of models with additional training for specific domain analysis, such as in evidence-based scientific claim verification⁵⁴.
Integration of various strategies presents the potential for hybrid models that leverage LLM efficiency alongside human oversight. Such approach could lead to more effective outcomes, balancing computational strengths with human intuition and judgment.
The dataset, human annotations and grading form with instructions, and LLM prompts are available in the Open Science Framework repository: https://osf.io/a43pj/?view_only=829339c653774ebb86123ca99b6551f5The code repository used for the experiments in this study is available at GitHub Repository: https://github.com/zagovora/LLM-Laten-Content-Analysis/The open-source models utilized in this study can be found at the following links: Llama-3.1-70B: https://huggingface.co/meta-llama/Llama-3.1-70BMixtral-8 × 7B-v0.1: https://huggingface.co/mistralai/Mixtral-8 × 7B-v0.1.
The original online version of this Article was revised: The first sentence in the acknowledgements section in the original version of this Article was incorrect. It now reads: “This article/publication is based upon work from COST Action What are Opinions? Integrating Theory and Methods for Automatically Analyzing Opinionated Communication (OPINION), CA21129, supported by COST (European Cooperation in Science and Technology).”
Neuendorf, K. A. The Content Analysis Guidebook. (SAGE Publications, 2017). https://doi.org/10.4135/9781071802878
Liu, B. Sentiment Analysis and Opinion Mining (Springer, 2012). https://doi.org/10.1007/978-3-031-02145-9
DiMaggio, P., Evans, J. & Bryson, B. Have Americans’ social attitudes become more polarized? Am. J. Sociol. 102, 690–755. https://doi.org/10.1086/230995 (1996).
Article Google Scholar
Pang, B. & Lee, L. Opinion mining and sentiment analysis. Found. Trends Inf. Retr. 2, 1–135. https://doi.org/10.1561/1500000011 (2008).
Article Google Scholar
Ghosh, D., Fabbri, A. R. & Muresan, S. Sarcasm analysis using conversation context. Comput. Linguist. 44, 755–792. https://doi.org/10.1162/coli_a_00336 (2018).
Article Google Scholar
Bojic, L., Kovacevic, P. & Cabarkapa, M. GPT-4 surpassing human performance in linguistic pragmatics. Preprint at (2023). http://arxiv.org/abs/2312.09545
Krippendorff, K. Content analysis: An Introduction To its Methodology (SAGE, 2019). https://doi.org/10.4135/9781071878781
Sebastiani, F. Machine learning in automated text categorization. ACM Comput. Surv. 34, 1–47. https://doi.org/10.1145/505282.505283 (2002).
Article Google Scholar
Pang, B., Lee, L. & Vaithyanathan, S. Thumbs up? Sentiment classification using machine learning techniques. In Proc. ACL-02 Conf. Empir. Methods Nat. Lang. Process. 10, 79–86. https://doi.org/10.3115/1118693.1118704 (2002).
Article Google Scholar
Mikolov, T., Chen, K., Corrado, G. S. & Dean, J. Efficient estimation of word representations in vector space. Preprint at (2013). http://arxiv.org/abs/1301.3781
Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735 (1997).
Article CAS PubMed Google Scholar
Socher, R. et al. Recursive deep models for semantic compositionality over a sentiment treebank. In Proc. Empirical Methods Nat. Lang. Process.1631–1642 (2013). https://aclanthology.org/D13-1170
Poria, S., Cambria, E., Hazarika, D. & Vij, P. A deeper look into sarcastic tweets using deep convolutional neural networks. Preprint at (2017). http://arxiv.org/abs/1610.08815
Thareja, R. Multimodal sentiment analysis of social media content and its impact on mental wellbeing: An investigation of extreme sentiments. In Proc. 7th Joint Int. Conf. Data Sci. Manage. Data (CODS-COMAD). 469-473 https://doi.org/10.1145/3632410.3632462 (2024).
Li, H., Lu, Y. & Zhu, H. Multi-modal sentiment analysis based on image and text fusion based on cross-attention mechanism. Electronics 13, 2069. https://doi.org/10.3390/electronics13112069 (2024).
Article Google Scholar
Akhtar, M. S., Chauhan, D. S. & Ekbal, A. A deep multi-task contextual attention framework for multi-modal affect analysis. ACM Trans. Knowl. Discov. Data. 14, 1–27. https://doi.org/10.1145/3380744 (2020).
Article Google Scholar
Cheng, Y., Li, K. & Kang, Z. EMKG: Efficient matchings for knowledge graph integration in stance detection. In Joint Conf. Neural Netw. (IJCNN) 1–8 (2024). https://doi.org/10.1109/IJCNN60899.2024.10651163
Guo, Q. et al. BANER: Boundary-aware LLMs for few-shot named entity recognition. In Proc. 31st Int. Conf. Comput. Linguist. 10375-10389 https://aclanthology.org/2025.coling-main.691/ (2025).
Joshi, A., Bhattacharyya, P. & Carman, M. J. Automatic sarcasm detection: A survey. ACM Comput. Surv. 50, 1–22. https://doi.org/10.1145/3124420 (2018).
Article Google Scholar
Baruah, A., Das, K., Barbhuiya, F. & Dey, K. Context-aware sarcasm detection using BERT. In Proc. 2nd Workshop Figurative Lang. Process. 83-87 https://doi.org/10.18653/v1/2020.figlang-1.12 (2020).
Oprea, S. & Magdy, W. iSarcasm: A dataset of intended sarcasm. In Proc. 58th Annu. Meet Assoc. Comput. Linguist. 1279–1289. https://doi.org/10.18653/v1/2020.acl-main.118 (2020).
Vaswani, A. et al. Attention is all you need. In Proc. 31st Int. Conf. Neural Inf. Process. Syst. 6000-6010 https://doi.org/10.5555/3295222.3295349 (2017).
Devlin, J., Chang, M. W., Lee, K. & Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conf. North. Am. Chapter Assoc. Comput. Linguist. 4171-4186 https://doi.org/10.18653/v1/N19-1423 (2019).
Liu, Y. et al. RoBERTa: A robustly optimized BERT pretraining approach. Preprint at (2019). http://arxiv.org/abs/1907.11692
Zhu, X., Kang, Z. & Hui, B. FCDS: Fusing constituency and dependency syntax into document-level relation extraction. In Proc. 2024 Joint Int. Conf. Comput. Linguist Lang. Resour. Eval (LREC-COLING). 7141-7152 https://aclanthology.org/2024.lrec-main.627/ (2024).
Radford, A. et al. Language models are unsupervised multitask learners. OpenAI Blog (2019). https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
Brown, T. B. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901 (2020). https://papers.nips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
Google Scholar
Chang, T. A. & Bergen, B. K. Language model behavior: A comprehensive survey. Comput. Linguist. 50, 293–350. https://doi.org/10.1162/coli_a_00492 (2024).
Article Google Scholar
Floridi, L. & Chiriatti, M. GPT-3: Its nature, scope, limits, and consequences. Minds Mach. 30, 681–694. https://doi.org/10.1007/s11023-020-09548-1 (2020).
Article Google Scholar
Kawintiranon, K. & Singh, L. PoliBERTweet: A pre-trained Language model for analyzing political content on Twitter. In Proc. 13th Lang. Resour. Eval Conf. 7360-7367 https://aclanthology.org/2022.lrec-1.801 (2022).
Cambria, E., Li, Y., Xing, F. Z., Poria, S. & Kwok, K. SenticNet 6: Ensemble application of symbolic and subsymbolic AI for sentiment analysis. In Proc. 29th ACM Int. Conf. Inf. Knowl. Manage. 105–114 (2020). https://doi.org/10.1145/3340531.3412003
Bender, E. M., Gebru, T., McMillan-Major, A. & Shmitchell, S. On the dangers of stochastic parrots: Can language models be too big? In Proc. 2021 ACM Conf. Fairness, Accountability, Transparency 610–623 (2021). https://doi.org/10.1145/3442188.3445922
Bodroža, B., Dinić, B. M. & Bojić, L. Personality testing of large Language models: Limited Temporal stability, but highlighted prosociality. R. Soc. Open. Sci. 11, 240180. https://doi.org/10.1098/rsos.240180 (2024).
Article PubMed PubMed Central Google Scholar
Zhang, D., Rayz, J., Pradhan, R. & Counteracts Testing stereotypical representation in pre-trained language models. Preprint at (2023). http://arxiv.org/abs/2301.04347
Imamguluyev, R. The rise of GPT-3: Implications for natural Language processing and beyond. Int. J. Res. Publ Rev. 4, 4893–4903. https://doi.org/10.55248/gengpi.2023.4.33987 (2023).
Article Google Scholar
Gao, T., Fisch, A. & Chen, D. Making pre-trained language models better few-shot learners. In Proc. 59th Annu. Meet. Assoc. Comput. Linguist. 3816–3830 (2021). https://doi.org/10.18653/v1/2021.acl-long.295
Hayes, A. F. & Krippendorff, K. Answering the call for a standard reliability measure for coding data. Commun. Methods Meas. 1, 77–89. https://doi.org/10.1080/19312450709336664 (2007).
Article Google Scholar
Lottridge, S., Ormerod, C. & Jafari, A. Psychometric considerations when using deep learning for automated scoring. In Adv. Nat. Lang. Process. Educ. Assess. 15 (2023).
Opinion Website of the COST action network CA21129 – What are opinions? Integrating theory and methods for automatically Analyzing Opinionated Communication. (2024). https://www.opinion-network.eu/
OSF. LLM as a tool for textual analysis. Open. Sci. Framew. (2024). https://osf.io/a43pj/
Go, A., Bhayani, R. & Huang, L. Twitter sentiment classification using distant supervision. CS224N Project Rep. Stanf. Univ. 1–12 (2009). https://www-cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf
Iyyer, M., Enns, P., Boyd-Graber, J. & Resnik, P. Political ideology detection using recursive neural networks. In Proc. 52nd Annu. Meet Assoc. Comput. Linguist. 1113–1122. https://doi.org/10.3115/v1/P14-1105 (2014).
Buechel, S. & Hahn, U. EmoBank: Studying the impact of annotation perspective and representation format on dimensional emotion analysis. Proc. 15th Conf. Eur. Chapter Assoc. Comput. Linguist. 2, 578–585. https://doi.org/10.18653/v1/E17-2092 (2017).
Article Google Scholar
Khodak, M., Saunshi, N. & Vodrahalli, K. A large self-annotated corpus for sarcasm. Preprint at (2018). http://arxiv.org/abs/1704.05579
Pang, B. & Lee, L. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proc. 43rd Annu. Meet Assoc. Comput. Linguist. 115–124. https://doi.org/10.3115/1219840.1219855 (2005).
Riloff, E. et al. Sarcasm as contrast between a positive sentiment and negative situation. In Proc. 2013 Conf. Empirical Methods Nat. Lang. Process. 704–714 (2013).
Lombard, M., Snyder-Duch, J. & Bracken, C. C. Content analysis in mass communication: Assessment and reporting of intercoder reliability. Hum. Commun. Res. 28, 587–604. https://doi.org/10.1111/j.1468-2958.2002.tb00826.x (2002).
Article Google Scholar
Shrout, P. E. & Fleiss, J. L. Intraclass correlations: Uses in assessing rater reliability. Psychol. Bull. 86, 420–428. https://doi.org/10.1037/0033-2909.86.2.420 (1979).
Article CAS PubMed Google Scholar
Field, A. Discovering Statistics Using SPSS (SAGE, 2009).
Hofstede, G. Dimensionalizing cultures: The Hofstede model in context. Online Readings Psychol. Cult. 2, 8. https://doi.org/10.9707/2307-0919.1014 (2011).
Article Google Scholar
Dalal, S., Jain, S. & Dave, M. Deep knowledge-infusion for explainable depression detection. Preprint At. https://doi.org/10.48550/arXiv.2409.02122 (2024).
Article Google Scholar
Dalal, S. et al. A cross attention approach to diagnostic explainability using clinical practice guidelines for depression. IEEE J. Biomed. Health Inf. https://doi.org/10.1109/JBHI.2024.3483577 (2024).
Article Google Scholar
Singh, M., Kumar, S. & Sanasam Ranbir Singh. and. ClusterCore at SemEval-2024 Task 7: Few-Shot Prompting with Large Language Models for Numeral-Aware Headline Generation. In Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval- (2024). (2024). https://doi.org/10.18653/v1/2024.semeval-1.246
Kumar, S. et al. SciClaimHunt: A Large Dataset for Evidence-based Scientific Claim Verification. arXiv preprint arXiv:2502.10003. (2025). https://doi.org/10.48550/arXiv.2502.10003
Download references
This article/publication is based upon work from COST Action What are Opinions? Integrating Theory and Methods for Automatically Analyzing Opinionated Communication (OPINION), CA21129, supported by COST (European Cooperation in Science and Technology). We would like to express our deepest gratitude to the Action Chair, Christian Baden, for his exceptional leadership and guidance throughout this project. We also extend our sincere thanks to the esteemed members of the leadership team—Carlos Arcila Calderón, Mariken Van Der Velden, Nina Springer, and Aysen Şimsek—for their invaluable support and contributions.Our heartfelt appreciation goes to the following members of the COST Action who contributed significantly to this paper through their expert analysis: Aleksandar Tomašević, Jelena Gledić, Lenka Vochocová, Jaromír Mazák, Anela Mulahmetović Ibrišimović, Jana Rosenfeldová, Barbara Lewandowska-Tomaszczyk, Ledia Kazazi, Alba Haveriku, Alper Özcan, Anna Baczkowska, Ema Kristo, Nelda Kote, Bruno Yun, Carlos Alberto Cunha, Celia Yuen Sze Tsui, Raluca Buturoiu, José Santana Pereira, Julia Gottstein, Damian Guzek, Sjoerd B. Stolwijk, Melis Yazici, Marc Jungblut, Louis Escouflaire, Dimitra Milioni, Kinga Adamczewska, Shubin Yu, Roksana Gloc, Annie Waldherr, Laura Teruel, and Ana Milojević. Their expertise and dedication were instrumental to the success of this study.Lots of thanks goes to Olja Šojić, Bojana Dinić, Nenad Pantelić, and Miloš Agatonović for their significant assistance with the analysis. Their expertise and dedication were instrumental in the completion of this study.This work has been supported by the Short-Term Scientific Mission (STSM) Grant: LLMs as a Tool for Textual Analysis (E-COST-GRANT-CA21129-7d62eb40). The productive discussions, constructive criticism, and resources from the COST Action Network CA21129: What are Opinions? Integrating Theory and Methods for Automatically Analyzing Opinionated Communication (OPINION) has significantly contributed to the quality of this paper. For more information about this initiative, kindly visit https://www.opinion-network.eu/.This study was realised with the support of the Ministry of Science, Technological Development and Innovation of the Republic of Serbia, according to the Agreement on the realisation and financing of scientific research 451-03-136/2025-03/ 200025 and the Agreement on the realisation and financing of scientific research 451-03-136/2025-03This paper has been supported by the TWON (project number 101095095), a research project funded by the European Union, under the Horizon Europe framework (HORIZON-CL2-2022-DEMOCRACY-01, topic 07). More details about the project can be found on its official website: https://www.twon-project.eu/. We would like to express our gratitude for the support provided by the SDG Policy Metrics Project, a collaborative effort between the Institute for Artificial Intelligence Research and Development of Serbia and the Ministry of Science, Technological Development and Innovation of the Republic of Serbia.
Institute for Artificial Intelligence Research and Development of Serbia, Fruskogorska, Novi Sad, Serbia
Ljubiša Bojić
Institute for Philosophy and Social Theory, Digital Society Lab, University of Belgrade, Kraljice Natalije 45, Belgrade, 11000, Serbia
Ljubiša Bojić
Rheinland-Pfälzische Technische Universität Kaiserslautern-Landau (RPTU), Fortstraße 7, 76829, Landau, Germany
Olga Zagovora
German Research Centre for Artificial Intelligence (DFKI), Trippstadter Str. 122, 67663, Kaiserslautern, Germany
Olga Zagovora
Vilnius Gediminas Technical University, Saulėtekio al. 11, 10223, Vilnius, Lithuania
Asta Zelenkauskaite
Drexel University, 3201 Arch, 165, Philadelphia, PA, 19104, USA
Asta Zelenkauskaite
Faculty of Dramatic Arts, University of Montenegro, Bajova 6, 81250, Cetinje, Montenegro
Vuk Vuković
Faculty of Engineering, University of Kragujevac, Kragujevac, 34000, Serbia
Milan Čabarkapa
Faculty of Humanities and Social Sciences, Department of English Language and Literature, University of Tuzla, Tuzla, Bosnia and Herzegovina
Selma Veseljević Jerković
Faculty of Education and Health Sciences, Department of Psychology, University of Limerick, National Technological Park Limerick, Limerick, V94 T9PX, Ireland
Ana Jovančević
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
L.B. and O.Z. conceptualized the study and designed the methodology. L.B. conducted the experiments and data analysis with assistance from M.C. and V.V. O.Z. and A.Z. contributed to the interpretation of the results. A.J. and S.V.J. prepared the initial draft of the manuscript with input from L.B. and M.C. V.V. and A.Z. refined and revised the manuscript critically for important intellectual content. All authors, including L.B., O.Z., A.Z., V.V., M.C., S.V.J., and A.J., reviewed and approved the final version of the manuscript for submission.
Correspondence to Ljubiša Bojić.
Since this study included human participants, the ethical approval No. 20032024 was acquired from the Ethics Committee of the Institute for Artificial Intelligence Research and Development of Serbia. All methods were carried out in accordance with relevant guidelines and regulations.
The authors declare no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
Reprints and permissions
Bojić, L., Zagovora, O., Zelenkauskaite, A. et al. Comparing large Language models and human annotators in latent content analysis of sentiment, political leaning, emotional intensity and sarcasm. Sci Rep 15, 11477 (2025). https://doi.org/10.1038/s41598-025-96508-3
Download citation
Received: 05 December 2024
Accepted: 28 March 2025
Published: 03 April 2025
Version of record: 03 April 2025
DOI: https://doi.org/10.1038/s41598-025-96508-3
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative
Scientific Reports (2026)
Chinese Political Science Review (2025)
Advertisement
Scientific Reports (Sci Rep)
ISSN 2045-2322 (online)
© 2026 Springer Nature Limited
Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

source

Comparing large Language models and human annotators in latent content analysis of sentiment, political leaning, emotional intensity and sarcasm | Scientific Reports – Nature

Leave a Comment Cancel Reply