
Explained: Apple’s new study questions the ‘reasoning’ in large reasoning models


Can your language model really ‘think’? A new research paper by Apple’s team suggests that the reasoning abilities of large reasoning models (LRMs) may have been significantly overstated, which in reality could be a result of advanced pattern matching.
Over the past few days, the results of this research have sparked intense discussions in both academic and technical fields. A few others questioned the timing of releasing this paper, just a few days ahead of Apple’s flagship developers’ event, WWDC 2025. The Cupertino-headquartered company is expected to open up its LLM for outside app developers, media reports in May suggested.
In the paper titled “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models”, the authors argue that while LRMs may demonstrate improved performance on accepted reasoning benchmarks, their fundamental capabilities and limitations are rather insufficiently understood.

LRMs can be understood as machine learning models that have capabilities beyond large language models (LLMs), thanks to their reasoning and problem-solving skills. Some of the popular examples of such LRMs include OpenAI’s o1/o3, DeepSeek-R1, Claude 3.7 Sonnet Thinking, and Google Gemini Thinking. The ‘thinking’ models are characterised by an approach called the Chain-of-Thought mechanism that guides them to generate step-by-step reasoning before arriving at a final output.
The current evaluations on the efficacy of these models to demonstrate reasoning is primarily focused on established mathematical and coding benchmarks. However, the authors of the research argue that such evaluations do not offer insight into the structure and quality of reasoning and may even suffer from data contamination.
The experiment
The paper poses a few questions: Are these models truly reasoning or leveraging different forms of pattern matching? Does the performance scale with increasing complexity? What are the inherent limitations of current approaches, and what would help advance more robust reasoning capabilities?

Instead of relying on standard benchmarks like math problems, the study used controllable puzzle environments that allow for systematic variation in complexity by modifying puzzle elements while keeping the underlying logic intact.
They found that LRMs' accuracy collapses beyond certain complexities. “...their reasoning effort increases with problem complexity up to a point, then declines despite having an adequate token budget,” the researchers noted.
By comparing LRMs with standard LLMs using the same inference compute, the study reveals that standard models unexpectedly perform better on low-complexity tasks, LRMs have an edge on medium-complexity tasks where extra reasoning is beneficial, and both types of models fail entirely on high-complexity tasks.

Prominent researcher and a critic of hype around AI, Gary Marcus tweeted on X, “It is that LRM ‘reasoning models’ can never be relied on to properly execute algorithms."
“You can’t have reliable AGI without the reliable execution of algorithms.”
Path to AGI riddled with challenges
For many, a true north star for AI is achieving artificial general intelligence (AGI). Currently, only in theory, AGI is a type of AI that possesses human-level intelligence. Meaning, it can understand and perform any intellectual task that a human can. One of the most sought-after AI companies in the world — OpenAI — has said that its mission is to develop AGI or AI systems that are smarter than humans.
Reasoning is considered a cornerstone of AGI because it reflects the ability to solve unfamiliar problems, make logical inferences, and adapt to new situations, much like a human would. Unlike narrow AI systems that excel at specific tasks, AGI requires robust reasoning to operate across a wide range of domains without task-specific tuning. True reasoning involves not just pattern recognition, but understanding, planning, and abstract thinking, unlike what has been found in this study.