Approaching WWDC, Apple researchers dispute claims that AI is capable of reasoning

Ben Lovejoy | Jun 9 2025 - 5:54 am PT

Approaching WWDC, Apple researchers dispute claims that AI is capable of reasoning | Apple keyboard render with AI key

While Apple has fallen behind the curve in terms of the AI features the company has actually launched, its researchers continue to work at the cutting edge of what’s out there.

In a new paper, they take issue with claims being made about some of the latest AI models – that they are actually capable of step-by-step reasoning. Apple say its tests show that this simply isn’t true …

While it’s acknowledged that conventional generative AI models, aka Large Language Models (LLMs), have no ability to reason, some AI companies are claiming that a new generation of models can. These are being referred to as Large Reasoning Models (LRMs).

These grew out of attempts to have LLMs “show their work” – that is, lay out the individual steps taken to reach their conclusions. The idea is that if an AI can be forced to develop a chain of thought, and to take things one step at a time, that will stop them either making things up entirely or going off the rails at some point in their claims.

Some big claims are being made for this approach, but a new Apple research paper calls this “the illusion of thinking.” They argue that testing a range of LRMs shows that their “reasoning” quickly falls apart even with relatively simple logic challenges that are easy to solve algorithmically, like the Tower of Hanoi puzzle.

Tower of Hanoi is a puzzle featuring three pegs and n disks of different sizes stacked on the first peg in size order (largest at bottom). The goal is to transfer all disks from the first peg to the third peg. Valid moves include moving only one disk at a time, taking only the top disk from a peg, and
never placing a larger disk on top of a smaller one.

You can create simpler or more complex versions of the game by varying the number of disks.

What they found is that LRMs are actually worse than LLMs at the simplest versions of the puzzle, are slightly but not dramatically better when more discs are added – then fail completely with more than eight disks.

Simple problems (N=1-3) show early accuracy declining over time (overthinking), moderate problems (N=4-7) show slight improvement in accuracy with continued reasoning, and complex problems (N≥8) exhibit consistently near-zero accuracy, indicating complete reasoning failure, meaning that the model fails to generate any correct solutions within the thought.

In fact, they demonstrated that LRMs fail even when you give them the algorithm needed to solve it! They say that these findings cast doubt on claims being made about the latest AI models.

These insights challenge prevailing assumptions about LRM capabilities […] Our findings reveal fundamental limitations in current models: despite sophisticated self-reflection mechanisms, these models fail to develop generalizable reasoning capabilities beyond certain complexity thresholds.

New York University professor emeritus of psychology and neural science Gary Marcus – who has long argued that LRMs are incapable of reasoning – said that it shows that we need to move beyond the hope that making more and more capable LLMs will eventually result in intelligence.

Anybody who thinks LLMs are a direct route to the sort of AGI that could fundamentally transform society for the good is kidding themselves. This does not mean that the field of neural networks is dead, or that deep learning is dead. LLMs are just one form of deep learning, and maybe others — especially those that play nicer with symbols – will eventually thrive. Time will tell. But this particular approach has limits that are clearer by the day.

Photo by BoliviaInteligente on Unsplash

Add 9to5Mac to your Google News feed.

FTC: We use income earning auto affiliate links. More.

Check out 9to5Mac on YouTube for more Apple news:

Comments

Author

Ben Lovejoy benlovejoy

Ben Lovejoy is a British technology writer and EU Editor for 9to5Mac. He’s known for his op-eds and diary pieces, exploring his experience of Apple products over time, for a more rounded review. He also writes fiction, with two technothriller novels, a couple of SF shorts and a rom-com!

Approaching WWDC, Apple researchers dispute claims that AI is capable of reasoning

Comments

Author

Ben Lovejoy's favorite gear

Dell 49-inch curved monitor

Approaching WWDC, Apple researchers dispute claims that AI is capable of reasoning

Comments

Guides

AAPL Company

Artificial Intelligence

Author

Ben Lovejoy's favorite gear

Dell 49-inch curved monitor