YOU ARE AT:AI-Machine-LearningAnthropic* fires back – AI reasoning works, Apple’s reasoning doesn’t

Anthropic* fires back – AI reasoning works, Apple’s reasoning doesn’t

NOTE (*): This article has been edited to reflect that the paper, The Illusion of the Illusion of Thinking, was wrongfully attributed to Anthropic, the company, as the lead author. In fact, the lead author is listed as ’C. Opus, Anthropic’, a reference to Claude Opus, Anthropic’s most advanced AI reasoning (LRM) system. The other author is Alex Lawsen, a senior program associate in AI governance and policy at Open Philanthropy (’A. Lawsen, Open Philanthropy’). Open Philanthropy is a US-based “philanthropic advising and funding organization”. As such, the response to Apple, contained in the paper, is AI-reasoned and human-edited, presumably. Lawsen has since written an explanation, of sorts – which explains the exercise away as a joke. Certainly, this changes the narrative, even if the arguments it makes are the same as originally reported (now edited) below.

Apple’s debunking of high-end AI has been debunked – in a paper apparently delivered by US consultancy Open Philanthropy, and reasoned to an extent by AI itself. Specifically, the report, authored by ‘Anthropic’ and Open Philanthropy argues that top-level reasoning models did not fail to reason, as Apple claimed last week – but were wrongly judged on formatting, output length, and impossible tasks. The real problem is bad benchmarks, the new repoprt says.

AI at loggerheads – ’Anthropic’ and Open Philanthropy argue that recent tests claiming “reasoning collapse” in AI models actually reveal flaws in Apple’s evaluation methods – not the models’ reasoning capabilities.

Bad tests or bad faith – Apple treated reasoning models like text generators, penalizing them for hitting token limits or for formatting errors; some puzzles were even unsolvable, ’Anthropic’ and Open Philanthropy argue.

AI is reasonable – ’Anthropic’ and Open Philanthropy claim AI models reason correctly when allowed to output code or identify impossible problems – proving the issue lies in how AI is tested, and not whether it can think.

’Anthropic’ has hit back at Apple for failing to properly understand the results of its own tests on the cognitive abilities of AI. The frontier AI outfit, directly implicated in recent Apple research about “accuracy collapse” in large reasoning models (LRMs), is named as the lead author of a new paper, which says the reported failures are not signs of AI reasoning limits, but of flawed experimental design, unrealistic expectations, and a misinterpretation of the results.

The original paper from Apple was titled, The Illusion of Thinking; the new paper is called The Illusion of the Illusion of Thinking. Touche. In a twist, missed by RCR in the original version of this article, the authorship of the response paper is cited as ’C. Opus, Anthropic’ and ’A. Lawsen, Open Philanthropy’; the first, positioned as lead author, is a reference to Claude Opus, Anthropic’s LRM system, which was tested by Apple in its investigation, and which suggests the ’Anthropic’ response is, in part, reasoned by AI – or else it is (anyway) a joke.

Alex Lawsen, senior program associate in AI governance and policy at Open Philanthropy, has at least stewarded the research and production. It changes the broader debate, although the arguments presented in the new paper are the same – as is the reporting of them, effectively.

Claude and Lawsen write, in the conclusion of the new research: “[Apple’s] results demonstrate that models cannot output more tokens than their context limits allow, that programmatic evaluation can miss both model capabilities and puzzle impossibilities, and that solution length poorly predicts problem difficulty. These are valuable engineering insights, but they do not support claims about fundamental reasoning limitations.”

To recap, Apple researchers (Shojaee et al) sought to evaluate the reasoning abilities of the ‘thinking’ versions of the latest LRM-class models – notably Anthropic’s Claude 3.7 Sonnet and DeepSeek’s R1/V3 systems, plus OpenAI’s high-end o3-mini large language model (LLM), part of its GPT-4.5 family. They proposed certain sequence, logic, and puzzle problems – called Tower of Hanoi, River Crossing, Blocks World – and found them to fail.

These models over-complicate simple tasks and crash completely during complex ones, the authors of the Apple paper concluded. They suggested the most advanced LRMs do not properly ‘reason’ at all; rather they pattern-match more like standard LLMs, and come unstuck when faced with problems that require multi-step planning that goes beyond examples memorised in training. In other words, they don’t think for themselves – ‘outside of the box’, as it were.

Anthropic’s beef (as it is presented) is that Apple tested LRMs as if they were standard LLMs, effectively – and then blamed them for failing to generate text efficiently, rather than for failing to ‘reason’. In other words, the new paper argues that Apple tested LRM-class models against LLM-style criteria focused on output fidelity, step functions, and rigid formatting. The “accuracy collapse” during the Tower of Hanoi puzzle is just down to models hitting their token (output) limits, it says.

They failed Apple’s rigid output constraints, forcing the LRM to cut-off mid calculation, rather than the task itself – the argument goes. It is a practical engineering failure, rather than an abstract cognitive one. “A critical observation overlooked in the original study: models actively recognize when they approach output limits… This demonstrates that models understand the solution pattern but choose to truncate output due to practical constraints.”

Claude and Lawsen add: “This mischaracterisation… as ‘reasoning collapse’ reflects an issue with automated evaluation systems that fail to account for model awareness and decision-making.” Worse, the whole premise of the River Crossing puzzle – how to carry six people across a river on a three-person boat, when cannibals cannot outnumber missionaries – is impossible, anyway. 

There is no way to get everyone across. Faced with the problem, these LRMs say as much, effectively. And yet Apple penalises them for their workings – argue Claude and Lawsen. Apple’s grading system marks logical solutions as wrong if they miss parts of the output or mess up parts of the formatting. “By automatically scoring these impossible instances as failures, the authors inadvertently demonstrate the hazards of purely programmatic evaluation.”

Their response goes on: “Models receive zero scores not for reasoning failures, but for correctly recognizing unsolvable problems – equivalent to penalizing a SAT solver for returning ‘unsatisfiable’ on an unsatisfiable formula.” Their paper says that, when asked to generate functions rather than turn-by-turn directions, LRMs performed with “high accuracy” – even on puzzles Apple said were total failures. 

Apple’s original tests asked LRMs to enumerate every move, and exhaust their token limits, leading to incomplete outputs. When it instructed them to output code functions, they solved the puzzles. It writes: “When we control for these experimental artifacts, by requesting generating functions instead of exhaustive move lists, preliminary experiments across multiple models indicate high accuracy on… instances previously reported as complete failures.” 

The new paper concludes: “The question isn’t whether LRMs can reason, but whether our evaluations can distinguish reasoning from typing.” Which is its entire critique, for the whole AI research community: that academic tests to evaluate reasoning in frontier models should be appropriate, and appraise their logic, and not just whether they can type out the steps, especially when constrained by rigid output demands.

All of which makes for an interesting stand-off – between the LRM leader and the LRM laggard. It feeds into a separate discussion, also, about why this matters more broadly, which is written about by RCR Wireless as a polemical think-piece here, and a more balanced account here

ABOUT AUTHOR

James Blackman
James Blackman
James Blackman has been writing about the technology and telecoms sectors for over a decade. He has edited and contributed to a number of European news outlets and trade titles. He has also worked at telecoms company Huawei, leading media activity for its devices business in Western Europe. He is based in London.