Can AI predict the market? LLMs put to the test in financial sentiment study

The research team evaluated nine different tools, including LLM-based platforms, traditional sentiment libraries, and cloud services. Using the Financial Phrase Bank dataset - a well-established benchmark of market-labeled financial statements - they compared each model’s sentiment classifications to ground truth labels.


CO-EDP, VisionRICO-EDP, VisionRI | Updated: 26-05-2025 09:49 IST | Created: 26-05-2025 09:49 IST
Can AI predict the market? LLMs put to the test in financial sentiment study
Representative Image. Credit: ChatGPT

As generative AI systems make deeper inroads into finance, a new study offers a timely benchmark of how well large language models (LLMs) can interpret nuanced financial sentiment - often coded in vague, hedged, or deliberately strategic language. The paper, titled “Can AI Read Between the Lines? Benchmarking LLMs on Financial Nuance”, was published on arXiv by researchers from Santa Clara University. It presents one of the most rigorous comparisons to date of sentiment analysis capabilities across top LLMs, traditional NLP tools, and enterprise AI platforms, using Microsoft earnings calls and a financial phrase dataset as the testbed.

The study is the result of a Microsoft-sponsored capstone project and evaluates the accuracy, utility, and limitations of LLMs such as Microsoft Copilot, OpenAI’s ChatGPT-4o, and Google’s Gemini 2.0 Flash. It combines benchmarking, prompt engineering, and real-world stock performance correlation to explore the extent to which AI tools can serve as reliable assistants for financial analysts.

How accurately can LLMs decode financial tone?

The research team evaluated nine different tools, including LLM-based platforms, traditional sentiment libraries, and cloud services. Using the Financial Phrase Bank dataset - a well-established benchmark of market-labeled financial statements - they compared each model’s sentiment classifications to ground truth labels.

Microsoft’s Copilot (Online and Local versions) delivered the best results with 82% accuracy, followed by ChatGPT-4o at 77.6% and Gemini at 68%. In contrast, traditional tools like NLTK and Azure Language AI achieved only 48.8% and 54%, respectively. Notably, even prompt-engineered versions of older models like Copilot 365 using TextBlob showed little to no improvement, stuck at around 45% accuracy.

This dramatic gap highlights LLMs’ strength in recognizing filler words, indirect cues, and domain-specific phrasing common in earnings calls. Where traditional models require aggressive text cleaning that strips context, LLMs thrived on raw input, showing that more data can often equal better inference when models are capable of processing it.

However, the study also points out that no model exceeded 85% accuracy, suggesting that domain experts still outperform AI in high-stakes analysis. LLMs remain tools, not substitutes, for human judgment.

Can AI-derived sentiment predict stock market reactions?

The research team extended their analysis from static benchmarking into real-world financial forecasting. Using Microsoft’s quarterly earnings call transcripts, they segmented sentiment by business unit, ranging from Devices and Gaming to Cloud and Search, and compared tone to subsequent stock price movement.

The results were revealing. While overall sentiment did not correlate strongly with share price shifts, business-line sentiment did. For example, in Q3 2023, a spike in positive sentiment for Microsoft’s Devices segment preceded a significant stock gain. Conversely, positive tone in Search and News Advertising during Q1 2025 corresponded with a price drop, suggesting that investors may interpret optimism in certain segments with skepticism.

These inversions, especially common in segments like Gaming and Q&A, reinforce that tone alone is not predictive. Market expectations, prior performance, and broader trends all interact with sentiment to shape investor behavior.

Still, segment-level analysis proved far more valuable than broad-stroke interpretations. LLMs, especially ChatGPT-4o, allowed researchers to isolate these finer patterns, making the case for their use in exploratory analysis and early warning signal generation.

What are the limitations and optimization needs for LLMs in finance?

Despite their strong performance, LLMs in the study demonstrated several shortcomings. Chief among them was a lack of transparency - particularly in tools like Microsoft Copilot 365, which sometimes routed tasks to simpler sentiment engines like TextBlob without notifying users. This undermines trust, especially in finance, where clarity about method is as important as results.

LLMs also struggled with structured data like CSVs. The models often required pre-processing or plain-text conversion to avoid hallucination or formatting errors, adding friction to workflows.

The research also warns against overreliance on LLMs for strategic decisions. While these models supported code generation and visualizations, the key research questions and analytical interpretations came from the human team. AI, the authors emphasize, enhances rather than replaces human insight.

They call for improvements in three areas:

  • Performance transparency for fallback behavior;
  • More robust support for structured data formats;
  • And guardrails against hallucinations in fundamental NLP tasks.
  • FIRST PUBLISHED IN:
  • Devdiscourse
Give Feedback