This whole time, you’ve been your phone’s operating system
Here’s the uncomfortable truth about smartphones: For the past fifteen years, you’ve been doing the computer’s job.
You’re texting about dinner plans. Someone mentions a restaurant. You copy the address, switch to Maps, paste it, check the route, switch to Calendar, create an event, switch back to confirm. Five app switches for a simple task. You are the router. You are the integration layer. You manually transfer context between applications because the phone can’t.
The average user switches between apps 352 times per day. Each switch requires remembering where information lives, manually transferring context, and reorienting to a different interface. We became digital janitors, sweeping information from one app to another.
AI phones promise to change this fundamental arrangement. Not by adding features, but by changing who does the work. The shift is from “app-centric” computing, where you orchestrate applications, to “agent-centric” computing, where AI orchestrates applications on your behalf.
The question is whether this shift is real, and if so, where the economic value gets captured.
Three capabilities that actually matter
I’ve spent the past six months investigating Samsung Galaxy AI, Apple Intelligence, and Chinese manufacturers. There is a lot of noise, but I was struck by three strong signals that I believe demonstrate traction:
Cross-app orchestration: ByteDance’s Doubao assistant on the ZTE Nubia M153 demonstrates the end state. A user asks: “Compare the price of this hair dryer across JD.com, Taobao, and Pinduoduo.” The agent opens each e-commerce app, searches for the product, extracts prices, and presents a comparison. No pre-programmed integration. The AI reads the screen and clicks buttons like a human would. This GUI-as-API approach means any app becomes agent-compatible without modification.
Honor’s “Magic Portal” shows the intermediate step: copy an address and the system offers to open navigation or ride-hailing. Copy a phone number, it offers to call or message. This predictive intent resolution eliminates the manual workflow.
Digital memory without filing: You take hundreds of photos, receive thousands of messages monthly. Traditional phones make you the librarian. AI phones build persistent memory automatically through on-device vector databases that index every interaction (screenshots, messages, locations etc).
Ask “Where did I park?” and the phone finds the photo you took three hours ago without any tags or organization. Ask “What was that pickup code?” and it parses SMS history automatically. Samsung’s Personal Data Engine and Apple’s similar on-device processing create this personal knowledge graph locally, enabling personalization without privacy invasion.
Multi-step autonomous execution: Traditional voice assistants mapped rigid commands to specific actions. Modern agents reason through complex tasks. “Book a table at that place I liked on social media” becomes: search recent posts for restaurants, identify likely candidate, check reservations, find available times, execute booking.
The Nubia M153’s dedicated AI Button enables this. Early adopters report 70% success rates for well-structured tasks. It is imperfect but functional enough to be genuinely useful.
Where the money gets made
The smartphone market is projected to reach $579 billion by 2026 despite flat unit shipments. This growth comes entirely from AI-capable flagships driving higher prices. But hardware premiums are just the beginning.
Three revenue streams are emerging:
Hardware premiums (immediate): China’s premium segment ($600+) captured 28% of market share in 2024, up from 11% in 2018. This is a dramatic shift toward premiumization according to Counterpoint Research. In Q1 2025, China’s overall smartphone market grew just 3.3% year-over-year, but premium devices continued driving disproportionate value capture. This validates that AI capabilities justify hardware premiums, but it’s one-time capture with diminishing returns as capabilities commoditize.
Subscriptions (2026 rollout): Samsung’s “free until end of 2025” is market testing. Expected tiered model:
- Basic tier (free): On-device features. This includes translations, photo edits, and local summaries
- Pro tier ($10-15/month): Cloud-intensive features. This includes advanced reasoning, and unlimited queries
- Enterprise tier ($20+/month): Enhanced privacy, priority processing, and business workflows
Apple is rumored to prepare “Apple Intelligence+” bundled with Apple One. The question: Will users pay for AI as a service, or do they expect it bundled in hardware? History suggests skepticism. Mobile consumers repeatedly rejected software subscriptions. But if AI becomes genuinely essential, this time might be different.
Agentic commerce fees (the real prize): When users tell their agent “buy the best running shoes for flat feet under $100” instead of browsing Amazon, the agent executes transactions directly. The provider captures:
- Transaction commissions (2-5% of purchase value)
- Sponsored placement fees (brands bidding for recommendations)
- Priority processing for partner merchants
Google’s Agent Payments Protocol (AP2) is building infrastructure for this today. ByteDance, with 4 billion monthly users, could generate billions annually from 1-2% fees on agent-orchestrated commerce. This dividend dwarfs hardware margins.
The critical insight: Traditional app store commissions (30% of purchases) are threatened as agents disintermediate apps. The new model captures value at the transaction layer rather than app download layer. This is the battle that determines who owns mobile commerce in five years.
What changes fundamentally
The real value AI phones create is about removing friction from existing tasks to the point where they become practical rather than aspirational. But more fundamentally, it’s about shifting from applications as the unit of value delivery to agents as the primary interface.
Cross-app intelligence that actually works: Traditional smartphones sandbox applications for security, isolating data and functionality. AI agents bridge these silos through system-level permissions and intent frameworks.
Apple’s “App Intents” and Huawei’s “Intelligent Agent Framework” allow the OS to expose app functionalities as discrete actions the AI can invoke. More aggressive implementations use multimodal vision models to literally see the screen. Essentially, the agent observes the GUI and simulates human clicks and swipes.
This creates unprecedented interoperability. ByteDance’s Doubao demonstrates this vividly: a user can say “Compare this hair dryer’s price across JD.com, Taobao, and Pinduoduo” and the agent navigates each e-commerce app autonomously, extracting prices and presenting a comparison table. The agent treats the visual interface as its API.
The technical breakthrough is that you don’t need every app developer to build specific integrations. The agent can work with any app by reading the screen and manipulating the interface, just as a human would. This GUI-as-API approach means legacy apps immediately become agent-compatible without modification.
Persistent context without explicit input: Previous phone features required you to tell them what you wanted. AI phones infer from context through digital memory. Effectively on-device vector databases that index every interaction.
The Nubia M153, powered by ByteDance’s Doubao, maintains a personal knowledge graph using system-level permissions to read screen content in real-time. Users report asking “Where did I park?” and having the phone retrieve a parking spot photo taken hours earlier, or “What is the pickup code?” and having it parse SMS messages automatically.
This is a fundamental shift in the interaction model. You’re no longer searching databases of organized information. You’re querying a semantic layer over your chaotic digital life, and the AI finds relevant context even when you don’t remember where you stored it or whether you stored it at all.
Samsung’s implementation is more conservative but demonstrates similar value. Their Personal Data Engine analyzes user data on-device to enable natural language photo search (“show me beach photos from the trip with Vish”) without sending anything to the cloud. The photos aren’t tagged or organized, the AI understands image content and metadata relationships automatically.
Generative User Interfaces: The reliance on static, pre-compiled interfaces is diminishing. AI phones employ Generative UI (GenUI), where the interface constructs dynamically based on conversation and intent.
Instead of a static list of search results, an agent might generate a comparison table, an interactive widget, or a personalized dashboard on the fly. Google’s Gemini and Flutter’s GenUI SDK allow the LLM to describe a UI layout that renders natively. The interface becomes fluid rather than rigid.
Xiaomi’s HyperOS 2.0 introduces “AI Magic Painting” and dynamic lock screens that evolve based on user preferences. The interface adapts to the task rather than forcing tasks into predefined UI templates. This maximizes information density and relevance. You get exactly the interface needed for your current goal, not a generic template that serves all goals adequately but none perfectly.
Multimodal reasoning in real-time: Previous voice assistants were unimodal. They understood speech but not visual context. AI phones combine voice, vision, and screen content for fluid interaction.
Point your camera at a plant and ask “what’s this?” Samsung Galaxy AI recognizes the image, searches for information, and responds conversationally. This combines computer vision, web search, and natural language generation in a single interaction that feels natural. The technical achievement: doing this fast enough on-device that latency doesn’t break the interaction flow.
This capability extends to real-time translation overlays, live caption generation, and visual search. These are multimodal capabilities that require coordinating multiple AI models simultaneously with millisecond latency requirements.
The silicon constraint
Why did this shift take so long ? The constraint was silicon.
Large Language Models require massive parallel matrix multiplication. CPUs do these operations sequentially. That means slow and power-hungry. Neural Processing Units (NPUs) do thousands of simple operations simultaneously. Think: one smart person solving complex problems versus a thousand people doing simple addition in parallel. For AI workloads, the thousand win.
Qualcomm’s Snapdragon 8 Elite (Gen 5) is 37% faster than its predecessor, supporting multimodal models exceeding 10 billion parameters. The numbers matter: Running a 7-billion parameter model at 4-bit precision requires 4-5GB RAM just for model weights. This is why flagship AI phones now ship with 12-16GB RAM. This is not for multitasking apps, but for running LLMs locally.
Chinese manufacturers led model compression. Xiaomi reduced models from 6 billion to 4 billion parameters while maintaining capability. Vivo’s BlueLM 3B reportedly outperforms larger 8B models on specific tasks through architecture optimization. They proved targeted optimization beats raw parameter count for phone-specific use cases.
The remaining constraint is thermal management. Continuous AI inference generates heat, throttling performance and draining batteries. This is why truly continuous AI operation remains challenging despite impressive demos.
The hybrid architecture: Pure on-device is fast and private but limited. Pure cloud is powerful but slow (network latency) and privacy-invasive. The solution: dynamic workload distribution.
Samsung: Routine tasks on-device (instant, offline, private). Complex generative tasks go to Google’s cloud via Gemini. The system learns over time which is which.
Apple: On-device first, “Private Cloud Compute” for complex tasks with cryptographic privacy guarantees such as cloud processing without data retention.
ByteDance: Aggressive cloud use for maximum capability. Deep research and real-time price comparison require cloud orchestration.
There is a strategic divide between OEMs. Apple prioritizes privacy over capability (on-device first). Chinese manufacturers prioritize capability over privacy (cloud-first). Samsung occupies the middle (hybrid with user control).
What happens next?
2025-2026: Multi-agent workflows: Current assistants handle single tasks. Next step: coordinating multiple tasks across apps. “Plan my weekend” becomes: check weather, scan calendar, suggest activities based on preferences, make reservations, add events, share plans. Current systems do 2-3 steps; 2026 systems handle complete workflows. The constraint: reliability. 30% failure rates make multi-step workflows frustrating. Expect gradual expansion rather than sudden leaps.
2026-2027: Proactive intelligence with trust: Moving from reactive to proactive. AI notices you order groceries Thursday evenings, checks pantry via receipts, suggests reorders. Or notices you’re texting about meeting up, checks calendars, suggests times, creates event after confirmation.
The critical constraint is that users resist automation without explicit permission. Balance between helpful and creepy is subtle. Honor’s “Magic Portal” demonstrates the intermediate approach. They predict intent from immediate actions (copy address triggers navigation offer) rather than long-term pattern analysis. This feels helpful because it’s clearly triggered by user action.
2027-2028: Personal knowledge graphs mature: Understanding that “Vish” is your project partner, so documents mentioning “Project Lightsaber” are relevant when Vish texts. Current on-device vector databases handle hundreds of thousands of relationships; scaling to millions requires architectural improvements. Privacy implications demand on-device processing.
Beyond 2028: Conversational context over weeks. Federated learning enabling your phone to learn from millions of users’ patterns without sharing data. Local content generation for specific use cases where quality threshold is lower than entertainment. The realistic assessment: capabilities remain expensive and slow. Each generation handles more tasks locally, more reliably, more efficiently. This is steady improvement in capability-per-watt, not revolutionary breakthroughs.
How far current capabilities meet user needs
Let’s be honest about where we actually are versus the vision:
Translation and language: 80% there. Live translation works well for common language pairs, with 1-2 second latency that’s acceptable for most conversations. Limitations remain for slang, context-dependent meaning, and less common languages. Chinese manufacturers are ahead in Asian language support. Apple and Samsung lead in European languages.
The gap is translation that preserves nuance, humor, and cultural context. Current systems handle literal meaning but miss subtlety. For business communication or casual chat, this is mostly fine. For negotiation or emotional conversations, it’s inadequate.
Content summarization: 60% there. Systems can summarize explicit information (meeting notes, articles, emails) but struggle with implicit context, understanding what’s actually important versus merely mentioned, and connecting information across different sources and timeframes.
Example: Ask for “status of the Q2 project,” and current AI will find messages mentioning Q2 project. It won’t necessarily understand that the budget discussion last month, the hiring conversation last week, and today’s timeline update are all related and should be synthesized into a coherent status. Humans do this contextual connection naturally, AI does it poorly.
Photo and content creation: 70% there for editing, 30% there for creation. Removing objects from photos, enhancing images and cleaning audio all work surprisingly well. Creating content from descriptions remains hit-or-miss. Sometimes you get exactly what you wanted. Often you get something vaguely related that requires multiple refinement attempts. Google’s Nano Banana is the standard setter today.
The limitation in creating something requires understanding intent, which requires conversation and iteration. Phone interfaces aren’t optimal for this back-and-forth refinement. Expect creative features to work best for quick edits (remove this, enhance that) rather than sustained creation (make me a birthday video with these photos).
Cross-app intelligence: 40% there for vision-based approaches, 20% there for API-based approaches. Two paths exist: ByteDance’s “GUI-as-API” where the AI sees the screen and clicks buttons (works with any app immediately but unreliable), versus Apple/Google’s “App Intents” where developers expose specific actions (reliable but requires developer adoption).
The tension is that vision-based agents work universally but have 30% failure rates for complex tasks. They also raise security concerns. Nothing prevents them from accidentally making purchases or deleting data. API-based agents are reliable but require thousands of developers to implement integrations, which happens slowly.
Chinese manufacturers bet on vision-based dominance. If reliability improves to 90%+, they win because it works everywhere immediately. Western manufacturers bet on API adoption which is a slower rollout but more controlled, predictable behavior. The winner depends on whether computer vision gets reliable enough before API adoption reaches critical mass.
Four strategies, one battle
Apple (Privacy Fortress): On-device processing as competitive moat. Keep AI local, make privacy the feature. The challenge is that on-device AI is constrained by phone hardware. The recent leadership shakeup (replacing AI chief John Giannandrea) signals the initial strategy underperformed. The delayed China rollout (April 2025) creates a window for competitors. Apple’s decision to partner with Google Gemini for cloud capabilities is an admission that pure on-device has limits. It is rumored “Apple Intelligence+” subscriptions will be launching 2026.
Samsung (Hybrid Pragmatist): They are the best of both worlds. On-device for common tasks, cloud for complex tasks, and Google Gemini integration for capabilities Samsung won’t build. Early Galaxy AI launch defined the category but creates dependency on Google for models. Differentiation comes from hardware (foldables) and broader ecosystem (watches, tablets, appliances). “Free until 2025” tests subscription viability with optionality to monetize or bundle.
Chinese OEMs (Speed and Scale): Aggressive on-device optimization enabling AI on mid-range devices, not just flagships. Open model ecosystems involve partnering with ByteDance (Doubao), Alibaba (Qwen), Baidu (ERNIE). These partnerships reduce R&D costs. Government subsidies (up to $425) tilt market toward domestic brands. Key differentiator is localization for Chinese language, dialects, and local apps that foreign brands can’t match.
Huawei’s HarmonyOS Next completely severed from Android, building native “Agent Framework” with “Atomic Services” the AI invokes directly. Xiaomi focuses cross-device orchestration (“Human x Car x Home”) but faces tension: their agent might bypass surfaces where they display ads, threatening core revenue.
ByteDance (Platform Disruptor): The most radical strategy. They are not building hardware. Instead, they are becoming the AI layer atop Android. With 4 billion monthly users, Doubao becomes the platform while hardware partners (ZTE first, others in discussion) capture hardware margin.
The Nubia M153 demonstrates the vision: AI that operates the phone using multimodal vision to see the screen and click buttons autonomously. GUI-as-API means any app works without pre-programmed integration. User: “Book a table at that place I liked on social media.” Agent: parses screen history, opens apps, completes booking.
This threatens Super Apps (WeChat, Meituan) by becoming the primary interface. If successful, ByteDance captures platform value and commerce transaction fees while hardware commoditizes. The 30,000-unit launch sold out. This validates demand for this automation level.
The strategic battle: Who controls the agent layer controls commerce. ByteDance attempts platform capture. Apple and Google defend installed bases. Chinese OEMs hedge by building own assistants while partnering with platforms. The winner will reshape mobile computing’s next decade.
Component technology evolution: What changes the game
Looking forward, three component technology improvements will significantly expand capabilities:
Neuromorphic processors: Current NPUs are still fundamentally digital processors optimized for parallel operations. Neuromorphic chips mimic biological neural networks more directly. They process information through analog signals and sparse activation rather than dense matrix multiplication.
The advantage is a 10-100x improvement in energy efficiency for certain AI operations, enabling always-on AI that doesn’t drain battery. Current AI assistants must wake up when triggered because always-listening consumes too much power. Neuromorphic processing enables truly ambient intelligence that processes sensory input continuously, responding only when relevant.
Timeline: Early commercial neuromorphic chips in development (Intel Loihi, IBM TrueNorth), but mainstream phone integration is 2027-2029. The challenge isn’t just hardware. It’s training AI models to run efficiently on neuromorphic architectures, which requires different approaches than current deep learning.
Optical interconnects: Moving data between processor components (CPU, NPU, GPU, memory) consumes significant power. Electrical connections face bandwidth and power limits. Optical interconnects use light instead of electrons, enabling 10-100x higher bandwidth with much lower power consumption.
The impact: Enables more sophisticated on-device models by removing bandwidth bottlenecks. Currently, model size is limited by how quickly data can move through the system. Optical interconnects expand this limit considerably. Combined with better memory architectures, this could enable on-device AI quality approaching current cloud models.
Timeline: 2026-2028 for premium phones. The technology exists but needs miniaturization and cost reduction for mobile integration.
Federated learning infrastructure: Currently, on-device AI learns from your data locally, but that learning doesn’t benefit other users. Federated learning enables your phone’s AI to learn from patterns across millions of users without sharing individual data. Each phone trains locally, then shares only model improvements (not data) with the collective.
The advantage: Dramatically faster improvement in AI capabilities because learning leverages collective experience while preserving privacy. Your phone’s AI benefits from millions of others’ usage patterns without knowing anything about those specific users.
The challenge: This requires substantial infrastructure. Coordinating learning across millions of devices, aggregating improvements, detecting and filtering malicious inputs (poisoning attacks), and managing the computational load. Google has deployed early versions for keyboard predictions; expect expansion to broader AI features.
The big question
AI phones are establishing themselves as genuinely distinct. The capabilities solve real problems. Users pay premium prices ($579 billion market by 2026 despite flat shipments). But the category’s defining characteristic isn’t technology. It’s where value gets captured.
Three monetization streams are emerging: hardware premiums (one-time), subscriptions (recurring), and agentic commerce fees (transaction-based). The third is the real prize. When agents execute commerce directly, platform providers capture 2-5% of transaction value. For ByteDance with 4 billion users, this dwarfs hardware margins.
The strategic battle: Who controls the agent layer controls commerce. ByteDance attempts platform capture above the OS. Apple and Google defend installed bases. Chinese OEMs hedge with their own assistants while partnering.
The app economy disruption: Traditional 30% app store commissions threatened as agents disintermediate apps. If agents read screens and complete purchases without users seeing interfaces, developers lose advertising. This requires resolution: will apps block agents, or will a “three-sided market” emerge where developers pay placement fees for agent compatibility?
Timeline watch points:
- Late 2025: Samsung’s subscription transition reveals whether users view AI as essential infrastructure worth paying for
- 2026: First agentic commerce transactions at scale via Google’s AP2 and Chinese innovators
- 2027-2028: App-less interface emergence. Users interact primarily with agents, icon grids begin fading
The network economics reality: AI phones drive upgrade cycles but traffic implications are modest. Most AI processing is on-device or generates datacenter-to-datacenter traffic, not access network demand. On-device processing eliminates access network load and semantic compression may actually reduce traffic despite increased AI usage. Hyperscalers spend $100B+ on datacenter networks for AI. Mobile operators see marginal incremental revenue because processing happens elsewhere.
What to watch: The subscription transitions in late 2025 determine direct monetization viability. Agentic commerce infrastructure deployment in 2026 reveals transaction fee models. Platform capture attempts, especially ByteDance’s, show whether agents become the new OS layer or remain features within existing systems.
We’re establishing foundations for agentic computing that will define the next decade. The winner won’t be determined by who has the best AI features. It will be determined by who controls the commerce layer when users stop browsing and start delegating.
