Google published a research blog post on Tuesday about a new compression algorithm for artificial intelligence models. Within a few hours, memory resources were running low. Micron fell 3 percent, Western Digital fell 4.7 percent and SanDisk fell 5.7 percent as investors recalculated how much physical memory the artificial intelligence industry might really need.
The algorithm is called TurboQuant, and it solves one of the most expensive bottlenecks in running large language models: a key-value cache, a high-speed data store that stores context information so the model doesn’t have to recalculate it with each new token it generates. As models process longer inputs, the cache grows rapidly, consuming GPU memory that could otherwise be used to serve more users or run larger models. TurboQuant compresses the cache down to just 3 bits per value, down from the standard 16, reducing the memory footprint by at least six times with no measurable loss of precision, according to Google’s estimates.
The paper to be presented at ICLR 2026 is co-authored by Amir Zandieh, Research Scientist at Google, and Wahab Mirrokni, Vice President and Google Fellow, Google DeepMind, KAIST, and New York University. It builds on two previous papers from the same group: QJL, published at AAAI 2025, and PolarQuant, which will appear at AISTATS 2026.
How it works
TurboQuant’s main innovation is to remove the overhead that makes most compression methods less effective than the headline numbers suggest. Traditional quantization methods reduce the size of the data vectors, but must retain additional constants, normalization values, that the system needs to accurately decompress the data. These constants usually add one or two extra bits to each number, partially undoing the compression.
TurboQuant avoids this through a two-step process. The first stage, called PolarQuant, converts the data vectors from standard Cartesian coordinates to polar coordinates, separating each vector into a set of magnitudes and angles. Because the angular distributions follow predictable, concentrated patterns, the system can completely skip the expensive per-block normalization step. The second stage applies a QJL technique based on the Johnson-Lindenstrauss transform, which reduces the small residual error from the first stage to a single sign bit per dimension. The combined result is a representation that uses most of the compression budget to capture the meaning of the original data and minimal residual budget for error correction, with no overhead on normalization constants.
Google tested TurboQuant against five benchmarks for long-context language models, including LongBench, Needle in a Haystack, and ZeroSCROLLS, using open-source models from the Gemma, Mistral, and Llama families. At 3 bits, TurboQuant matched or outperformed KIVI, the current standard base for key value cache quantization published at ICML 2024. In the needle-and-haystack search tasks, which test whether the model can find a piece of data buried in a long passage, TurboQuant scored perfectly with six factors suppressed. With 4-bit precision, the algorithm accelerated computational focus eight times on Nvidia H100 GPUs compared to the uncompressed 32-bit base.
What did the market hear?
The stock market reaction was swift and, according to some analysts, disproportionate. Wells Fargo analyst Andrew Rocha noted that TurboQuant directly attacks the cost curve for memory in AI systems. If widely adopted, it quickly raises the question of how much storage capacity the industry actually needs, he said. But Rocha and others cautioned that the demand landscape for AI memory remains strong, and that compression algorithms have been around for years without fundamentally changing purchase volumes.
However, the concern is not unfounded. AI infrastructure spending is growing at an extraordinary rate with only Meta’s commitment Up to $27 billion in the latest deal with Nebius for dedicated computing power, and Google, Microsoft, and Amazon together plan to spend hundreds of billions in capital expenditures on data centers through 2026. A technology that reduces storage requirements by a factor of six does not reduce costs by a factor of six, because memory is only one component of the cost of a data center. But this ratio is changing, and at this scale, industrial costs, even marginal efficiency, are rapidly compounding.
A question of efficiency
TurboQuant comes at a time when the AI industry is forced to confront the economics of results. Model development, however large, is a one-time cost. Running it, serving millions of requests per day with acceptable latency and accuracy, is a recurring cost. AI products are financially viable at scale. The key-value cache is central to this calculation: it is the bottleneck that limits how many concurrent users a GPU can serve and how many context windows a model can practically support.
Compression techniques such as TurboQuant are part of a broader push for cheaper output, along with hardware improvements. Nvidia’s Vera Rubin architecture and Google’s own Ironwood TPUs. The question is whether they are efficiency gains will reduce the total amount of equipment the industry buys or allow for more ambitious deployments at roughly the same cost. The history of computing suggests the latter: when memory becomes cheaper, people store more; when bandwidth increases, applications consume it.
TurboQuant for Google also has direct commercial application beyond language models. The blog post notes that the algorithm improves on vector search, a technology that powers searches for semantic similarity across billions of elements. Google tested it against existing methods on the GloVe benchmark dataset and found it achieved superior recall rates without requiring special tuning for large codebooks or competing datasets. This is important because vector search is at the heart of everything from Google Search to YouTube recommendations to ad targeting, meaning it’s the basis of Google’s revenue.
The paper’s contribution is real: a training-free compression method that achieves measurably better results than the current state-of-the-art with strong theoretical foundations and practical implementation in production equipment. Whether it will reshape the economics of AI infrastructure or become just another optimization absorbed by the industry’s insatiable computing appetite is a question the market will answer in months, not hours.





