Anthropic has slammed Apple’s AI tests as flawed, arguing that top-level reasoning models did not fail to reason – but were wrongly judged on formatting, output length, and impossible tasks. The real problem is bad benchmarks, it says.
AI research at loggerheads – Anthropic argues that recent tests claiming “reasoning collapse” in AI models actually reveal flaws in Apple’s evaluation methods – not the models’ reasoning capabilities.
Bad tests or bad faith – Apple treated reasoning models like text generators, penalizing them for hitting token limits or for formatting errors; some puzzles were even unsolvable, Anthropic argues.
AI is reasonable – Anthropic claims AI models reason correctly when allowed to output code or identify impossible problems – proving the issue lies in how AI is tested, and not whether it can think.
Anthropic has hit back at Apple for failing to properly understand the results of its own tests on the cognitive abilities of AI. The frontier AI outfit, directly implicated in recent Apple research about “accuracy collapse” in large reasoning models (LRMs), has issued a paper of its own, in direct response, which says the reported failures are not signs of AI reasoning limits, but of flawed experimental design, unrealistic expectations, and a misinterpretation of the results.
The original paper from Apple was titled, The Illusion of Thinking; Anthropic has now responded in kind, in a paper called The Illusion of the Illusion of Thinking. Touche.
It writes, in the conclusion of its new research: “[Apple’s] results demonstrate that models cannot output more tokens than their context limits allow, that programmatic evaluation can miss both model capabilities and puzzle impossibilities, and that solution length poorly predicts problem difficulty. These are valuable engineering insights, but they do not support claims about fundamental reasoning limitations.”
To recap, Apple researchers (Shojaee et al) sought to evaluate the reasoning abilities of the ‘thinking’ versions of the latest LRM-class models – notably Anthropic’s own Claude 3.7 Sonnet and DeepSeek’s R1/V3 systems, plus OpenAI’s high-end o3-mini large language model (LLM), part of its GPT-4.5 family. They proposed certain sequence, logic, and puzzle problems – called Tower of Hanoi, River Crossing, Blocks World – and found them to fail.
These models over-complicate simple tasks and crash completely during complex ones, the authors concluded. They suggested the most advanced LRMs do not properly ‘reason’ at all; rather they pattern-match more like standard LLMs, and come unstuck when faced with problems that require multi-step planning that goes beyond examples memorised in training. In other words, they don’t think for themselves – ‘outside of the box’, as it were.
Anthropic’s beef is that Apple tested LRMs as if they were standard LLMs, effectively – and then blamed them for failing to generate text efficiently, rather than for failing to ‘reason’. In other words, it argues that Apple tested LRM-class models against LLM-style criteria focused on output fidelity, step functions, and rigid formatting. The “accuracy collapse” during the Tower of Hanoi puzzle is just down to models hitting their token (output) limits, it says.
They failed Apple’s own rigid output constraints, forcing the LRM to cut-off mid calculation, rather than the task itself – the argument goes. It is a practical engineering failure, rather than an abstract cognitive one. “A critical observation overlooked in the original study: models actively recognize when they approach output limits… This demonstrates that models understand the solution pattern but choose to truncate output due to practical constraints.”
The authors (Opus and Lawsen, from Anthropic and Open Philanthropy) add: “This mischaracterisation… as ‘reasoning collapse’ reflects an issue with automated evaluation systems that fail to account for model awareness and decision-making.” Worse, the whole premise of the River Crossing puzzle – how to carry six people across a river on a three-person boat, when cannibals cannot outnumber missionaries – is impossible, anyway.
There is no way to get everyone across. Faced with the problem, these LRMs say as much, effectively. And yet Apple penalises them for their workings – argues Anthropic. Apple’s grading system marks logical solutions as wrong if they miss parts of the output or mess up parts of the formatting. “By automatically scoring these impossible instances as failures, the authors inadvertently demonstrate the hazards of purely programmatic evaluation.”
Anthropic’s response goes on: “Models receive zero scores not for reasoning failures, but for correctly recognizing unsolvable problems – equivalent to penalizing a SAT solver for returning ‘unsatisfiable’ on an unsatisfiable formula.” Anthropic’s paper says that, when asked to generate functions rather than turn-by-turn directions, LRMs performed with “high accuracy” – even on puzzles Apple said were total failures.
Apple’s original tests asked LRMs to enumerate every move, and exhaust their token limits, leading to incomplete outputs. When it instructed them to output code functions, they solved the puzzles. It writes: “When we control for these experimental artifacts, by requesting generating functions instead of exhaustive move lists, preliminary experiments across multiple models indicate high accuracy on… instances previously reported as complete failures.”
Anthropic concludes: “The question isn’t whether LRMs can reason, but whether our evaluations can distinguish reasoning from typing.” Which is its entire critique, for the whole AI research community: that academic tests to evaluate reasoning in frontier models should be appropriate, and appraise their logic, and not just whether they can type out the steps, especially when constrained by rigid output demands.
All of which makes for an interesting stand-off – between the LRM leader and the LRM laggard. It feeds into a separate discussion, also, about why this matters more broadly, which is written about by RCR Wireless as a polemical think-piece here, and a more balanced account here.