The AI ​​agents on the device have reached a fixed memory limit. Apple’s new architecture revolves around it.



On-device AI models have remained small because the entire set of weights must reside in DRAM, computing practical parameters far below those used by server-side deployments. Enterprise architects evaluating agency workloads have had to choose between capable cloud-dependent models and limited on-device models. Apple’s third-generation flagship models announced at WWDC26, break this limitation by moving the gravity DRAM completely.

The AFM 3 family was developed in collaboration with Google and includes five models: two on-device and three server-based, all running on Apple’s Private Cloud Computing frontier. Server-side models, including AFM 3 Cloud Pro for agent tool usage and complex reasoning, run on Nvidia GPUs in Google Cloud. The architecture in the device is owned by Apple itself. AFM 3 Core Advanced is a 20 billion parameter model that stores weights in NAND flash memory, not DRAM.

"Instead of forcing the entire model into DRAM, the full model is stored in flash memory," Apple’s research team wrote. "Since the NAND-to-DRAM bandwidth required by standard TN models is too slow to change the weights by token, AFM 3 Core Advanced makes routing decisions on each request."

How architecture actually works

The memory wall that Apple is working on is one that every native AI developer faces.

"You can’t put 20B parameters into RAM with any reasonable accuracy," Awni HannunResearcher at Anthropic and former Apple researcher, Posted in X. "They use quite exotic architecture to operate by today’s standards. A small model predicts which experts will load from NAND to RAM from a poll (or polls)."

This prediction and loading mechanism has three distinct components, each driven by the hardware limitations of consumer silicon.

The full 20B weight set lives in flash, not DRAM. AFM 3 Core Advanced stores the entire parameter set in NAND flash, not in active memory. Standard device placements require the full model to fit in DRAM, which includes their parameters. The approach, which Apple calls Instruction-After Pruning (IFP), which it developed with its own researchers, treats flash as the permanent home of the model and DRAM as a buffer that works for the experts that a particular request requires.

Expert routing occurs once per request, not per token. In the conventional Expert Mix model, the router selects different experts for each token generated—requiring a constant weight movement between flash and DRAM at the resulting rate. The NAND-to-DRAM bandwidth cannot support this. AFM 3 Core Advanced routes immediately once, selects a fixed set of experts, loads it into DRAM alongside the always-on shared experts, and generates all tokens from the same configuration.

"The main difference from typical TN is that you do this once per request and then generate all tokens with the same experts." Hanun wrote.

Depending on the complexity of the task, the number of active parameters is from 1B to 4B. Instead of running a fixed model size for each request, AFM 3 Core Advanced adjusts how many parameters to activate based on what the task requires—1 billion for simpler operations, up to 4 billion for more complex ones, all drawn from a pool of 20 billion parameters in flash.

What Apple has and has not revealed

The architecture document details the memory design and the sparse activation mechanism. Less is expected due to practical deployment constraints.

Apple’s profiling tools expose timing, not metrics, that determine manufacturability. "Energy, memory bandwidth, heat? not in documents," Marco Abis, founder of Ziraph, a profiler for native artificial intelligence on Apple silicon, Posted in X. "A noticeable gap considering what decides most of the device’s performance."

Abis also found no mention in Apple’s documentation—the Core AI documentation, the Foundation Models documentation, or the Private Cloud Compute security post—about transparently offloading the request to the device, or whether this routing is visible to the developer or user. For businesses that need to document where the result works, this is a direct compliance issue.

Not all information is available at this time. Apple has announced that a full technical report with benchmarks will be coming this summer.

What this means for enterprise architects

Regulated industries evaluating agent AI deployments now have a specific architectural decision.

  • The DRAM wall for agents on the device has just been moved. Enterprises evaluating agents that need to work in the cloud without commuting now have an on-premises option with 20 billion parameters to evaluate. The limitation goes from model capabilities to device hardware.

  • The private/cloud boundary is now an architectural decision, not a standard. Simple queries remain on the device; routes complex agent tasks to AFM 3 Cloud Pro in Private Cloud Computing. Apple doesn’t publicly disclose when a request is loaded or whether that routing is visible to the developer — a loophole that complicates policy decisions for organizations that need to document where results are generated.

  • The agent server tier depends on Google Cloud. AFM 3 Cloud Pro runs on Nvidia GPUs in Google Cloud. The Private Cloud Computing guarantee covers data privacy. It does not remove the Google Cloud dependency for server-side inference.

AFM 3 Core Advanced gives enterprises a choice in 20 billion device settings that was not available before WWDC26. Its placement on the scale depends on answers that Apple has yet to publish. These details should be shown in the summer technical report.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *