AI emotional intelligence surpasses human average in rigorous psychological trials
Emotional intelligence, especially in the form of ability EI, includes recognizing, understanding, managing, and reasoning about emotions. These skills are foundational for human communication, workplace harmony, and mental well-being. Given the surge in affective computing and emotionally responsive AI agents, this study seeks to clarify whether LLMs are truly equipped for emotional reasoning.

A groundbreaking peer-reviewed study published in Communications Psychology under the title “Large language models are proficient in solving and creating emotional intelligence tests” has shown that leading large language models (LLMs) including ChatGPT-4 can not only outperform humans on emotional intelligence (EI) assessments but also generate valid EI test items.
This research, conducted by Katja Schlegel, Nils R. Sommer, and Marcello Mortillaro, evaluated six LLMs across five standardized EI tests and conducted rigorous psychometric comparisons between original and LLM-generated versions of these tests.
Can AI demonstrate true emotional intelligence?
The study explores whether AI models possess the ability to reason about emotions in a manner comparable to humans. Emotional intelligence, especially in the form of ability EI, includes recognizing, understanding, managing, and reasoning about emotions. These skills are foundational for human communication, workplace harmony, and mental well-being. Given the surge in affective computing and emotionally responsive AI agents, this study seeks to clarify whether LLMs are truly equipped for emotional reasoning.
To investigate, the researchers tested ChatGPT-4, ChatGPT-o1, Copilot 365, Claude 3.5 Haiku, Gemini 1.5 Flash, and DeepSeek V3 using five validated EI tests: the Situational Test of Emotion Management (STEM), the Situational Test of Emotion Understanding (STEU), the Geneva Emotion Knowledge Test - Blends (GEMOK-Blends), and two subtests from the Geneva Emotional Competence Test (GECo) on regulation and management. Across the board, all six LLMs outperformed human benchmarks with a mean score of 81% accuracy versus the 56% average human score reported in the original validations.
The alignment between LLM performance and item difficulty mirrored human patterns - easy items for humans were also more likely to be answered correctly by LLMs. This suggests that LLMs may utilize item cues similarly to human reasoning patterns.
Can LLMs also design emotionally intelligent tests?
Beyond solving tests, the researchers asked whether LLMs could create EI assessments comparable in quality to those developed by psychologists. In this second part, ChatGPT-4 was used to generate test items for each of the five EI assessments. These AI-generated versions were then subjected to large-scale validation with 467 participants across five separate studies.
The generated tests were evaluated against several psychometric benchmarks: clarity, realism, item content diversity, internal consistency (using Cronbach’s alpha and item-total correlations), and correlations with external benchmarks like a vocabulary test and a different EI test.
Key findings included:
- Statistical Equivalence in Difficulty: The original and ChatGPT-generated tests showed equivalent difficulty, confirming that ChatGPT-4 can generate assessments of similar complexity.
- Clarity and Realism: While clarity was comparable, realism ratings were slightly higher for ChatGPT-generated tests. These small differences suggest that AI can create plausible and intelligible emotional scenarios.
- Content Diversity: Participants categorized original test scenarios into more thematic groups, suggesting greater variety in original items compared to AI-generated ones. This reflects a limitation in ChatGPT’s creative variability.
- Construct Validity: Correlations with other ability EI tests and with a vocabulary test were slightly weaker for ChatGPT-generated versions, though the differences remained within small effect sizes.
Despite some subtle gaps, the overall correlations between the original and ChatGPT-created tests were strong (r = 0.46), indicating they measured closely related constructs. Moreover, ChatGPT-4 successfully met strict formal requirements for each test format, such as aligning response options with correct emotional regulation strategies or generating blended emotion scenarios.
What are the implications for future AI-driven emotional assessments?
The study's implications extend well beyond academic testing. It positions ChatGPT-4 and similar LLMs as viable tools for supporting emotionally intelligent interactions in sectors like healthcare, customer service, education, and HR. In environments requiring empathy and emotion regulation, these models may provide stable, unbiased performance where human capabilities fluctuate.
One advantage highlighted is LLMs’ capacity for maximal performance unaffected by mood, fatigue, or stress - unlike humans, who may show “motivated inaccuracy” in sensitive contexts. Additionally, LLMs can maintain a consistently high standard in interpreting and managing emotional content, which is vital in emotionally charged human-AI interactions.
The researchers also underline ChatGPT-4’s potential role in psychometric development. Traditionally, designing EI assessments is resource-intensive, involving qualitative research, pilot testing, and psychometric validation. ChatGPT-4 was able to produce structured tests through a handful of prompts, significantly accelerating early test development stages. However, the authors caution that final test quality still relies on expert validation to weed out poorly performing items.
Despite the promising outcomes, limitations remain. The study was conducted using Western-centric cultural norms embedded in both test construction and LLM training data. Emotional expression and interpretation differ across cultures, and current LLMs may not fully adapt to these variances. Additionally, the black-box nature of LLMs raises concerns about explainability, consistency across model versions, and alignment with real-world complexities.
- READ MORE ON:
- emotional intelligence AI
- ChatGPT emotional intelligence
- AI vs humans emotional intelligence
- LLM emotional reasoning
- how ChatGPT scores on emotional intelligence tests
- large language models outperform humans in emotional reasoning
- emotional intelligence testing with AI
- emotional intelligence in AI-powered systems
- socio-emotional AI capabilities
- FIRST PUBLISHED IN:
- Devdiscourse