YOU ARE AT:AI-Machine-LearningThinking about ‘the illusion of thinking’ – why Apple has a point...

Thinking about ‘the illusion of thinking’ – why Apple has a point (Reader Forum)

In the past few days, Apple’s provocatively titled paper, The Illusion of Thinking, has sparked fresh debate in AI circles. The claim is stark: today’s language models don’t really “reason”. Instead, they simulate the appearance of reasoning until complexity reveals the cracks in their logic. Not surprisingly, the paper has triggered a rebuttal – entitled, The Illusion of the Illusion of Thinking, credited to “C. Opus”, a nod to Anthropic’s Claude Opus model, and

Alex Lawsen, who initially published the commentary on the arXiv distribution service as a joke, apparently. The joke got out of hand and the response has been widely circulated. Joke or not – does the LLM actually debunk Apple’s thesis? Not quite.

What Apple shows

maria sukhareva
Sukhareva – models do not rise to challenge

The Apple team set out to probe whether AI models can truly reason – or whether they’re just mimicking problem-solving based on memorized examples. To do this, the team designed tasks where complexity could be scaled in controlled increments: more disks in the Tower of Hanoi, more checkers in Jumping Checkers, more characters in River Crossing, more blocks in Blocks World.

The assumption is simple: if a model has mastered reasoning in simpler cases, it should be able to extend those same principles to more complex ones – especially when ample compute and context length remain available. But that’s not what happens. The Apple paper finds that even when operating well within their token budgets and inference capabilities, models do not rise to the challenge. 

Instead, they generate shorter, less structured outputs as complexity increases. This suggests a kind of “giving up,” not a struggle against hard constraints. Even more telling, the paper finds that models often reduce their reasoning effort just when more effort is needed. As further evidence, Apple references 2024 and 2025 benchmark questions from the American Invitational Mathematics Examination (AIME), a prestigious US mathematics competition for top-performing high-school students. 

While human performance improves year-on-year, model scores decline for more the unseen 2025 batch – supporting the idea that AI success is still heavily reliant on memorized patterns, and not flexible problem-solving.

Where Claude fails

The counterargument hinges on the idea that language models truncate responses not because they fail to reason, but because they “know” the output is becoming too long. One cited example shows a model halting mid-solution with a self-aware comment: “The pattern continues, but to avoid making this too long, I’ll stop here.”

This is presented as evidence that models understand the task but choose brevity. 

But it is anecdotal at best – drawn from a single social media post – and makes a large inferential leap. Even the engineer who originally posted the example doesn’t fully endorse rebuttal’s conclusion. They point out that higher generation randomness (“temperature”) leads to accumulated errors, especially on longer sequences – so stopping early may not indicate understanding, but entropy avoidance.

The rebuttal also invokes a probabilistic framing: that every move in a solution is like a coin flip, and eventually even a small per-token error rate will derail a long sequence. But reasoning isn’t just probabilistic generation; it’s pattern recognition and abstraction. Once a model identifies a solution structure, later steps should not be independent guesses – they should be deduced. The rebuttal doesn’t account for this.

But the real miss for the rebuttal is its argument that models can succeed if prompted to generate code. But this misses the whole point. Apple’s goal was not to test whether models could retrieve canned algorithms; it was to evaluate their ability to reason through the structure of the problem on their own. If a model solves a problem by simply recognizing it should call or generate a specific tool or piece of code, then it is not really reasoning – it is just recalling a solution or a pattern.

In other words, if an AI model sees the Tower of Hanoi puzzle and responds by outputting Lua code it has ‘seen’ before, it is just matching the problem to a known template and retrieving the corresponding tool. It is not ‘thinking’ through the problem; it’s just sophisticated library search.

Where this leaves us

To be clear, the Apple paper is not bulletproof. Its treatment of the River Crossing puzzle is a weak point. Once enough people are added to the puzzle, the problem becomes unsolvable. And yet Apple’s benchmark marks a “no solution” response as wrong. That is an error. But the thing is, the model’s performance has already collapsed before the problem becomes unsolvable – which suggests the drop-off happens not at the edge of reason, but long before it.

In conclusion, the rebuttal’s response, whether AI assisted or AI generated, raises important questions, especially around evaluation methods and model self-awareness. But the rebuttal rests more on anecdote and hypothetical framing than on rigorous counter-evidence. Apple’s original claim – that current models simulate reasoning without scaling it – remains largely intact. And it is not actually new; data scientists have been saying this for a long time.

But it always helps, of course, when big companies like Apple support the prevailing science. Apple’s paper may sound confrontational, at times – in the title, alone. But its analysis is thoughtful and well-supported. What it reveals is a truth the AI community must grapple with: reasoning is more than token generation, and without deeper architectural shifts, today’s models may remain trapped in this illusion of thinking.

Maria Sukhareva has been working in the field of AI for 15 years – in AI model training and product management. She is principal key expert in AI at Siemens. The views expressed are above are her’s, and not her employer’s. Her Substack blog page is here; her website is here.

ABOUT AUTHOR