TurboQuant: Worse Than Old Wine in a New Bottle

Tree-fall Amid Analysts

The old philosophical puzzle asks: if a tree falls in a forest and no one is there to hear it, does it make a sound? Reverse it. If a tree falls and everyone from the financial community is suddenly watching, does it become the biggest tree fall ever? That is roughly what is happening with Google's TurboQuant paper.

Google Research published a blog post in March 2026 describing an algorithm that compresses the Key-Value (KV) cache of large language models to 3 bits per value, cutting memory consumption by at least 6x and accelerating attention logit computation by up to 8x on H100 GPUs against an unquantised 32-bit baseline (rather than the compressed formats already standard in production). The strangest thing about the paper is that it had already been on arXiv since April 2025. The Google blog post this week, nearly a year later, is about the paper’s acceptance into one of the machine-learning field’s most prominent conferences (ICLR 2026) in late April. It is worth noting that 5,300 papers were accepted for this year’s ICLR, and no awards have been given yet.

We have read breathless “DeepSeek” moment-like tweets and commentaries, though few with changed opinions, all with a new thing to hang their hats and hates on. The bulls are back on the nineteenth-century Jevons’ paradox to prove why there is no change to their thesis, while the bears are replacing their South Sea Bubble arguments with this one to reassert why GenAI is a hype.

We are facing a series of questions on the topic in the last two days, and writing this note to make some fundamental points about what the industry trends are and the dangers of “new to me is new to the world” type analysis. There is always a chance that TurboQuant is more radical than dozens of similar announcements that have happened, which we discuss below, in the last few years. We do not know, and so far there is little evidence to suggest that anyone does. For those who seriously want to evaluate GenAI research space announcements, which we strongly support, it is important to remember the context.

Late Converts and the First Rush

For the better part of three years, a significant cohort of financial analysts treated generative AI with scepticism. The framing was familiar: South Sea Bubble analogies, scaling law plateau arguments, and observations that AI had been promising revolution since the 1960s.

The events of late 2025 have made the bearish arguments unconvincing for the time being. The pace of agentic deployment has forced many commentators to engage with generative AI for what it actually is rather than for what they expected it to be.

KV cache is an early chapter in that rebuild. For most analysts, it is genuinely new vocabulary. For the AI engineering community, it has been a primary concern since 2022. There are over 140 papers published every day on arXiv in the cs.AI category alone. Adding machine learning (cs.LG), natural language processing (cs.CL), and computer vision, the total exceeds 500 AI-related papers per day. The cs.AI category alone had 33,000 papers in 2024. We do not know but our GPTs claim a large number of papers containing efficiency claims of 3x, 5x, or more. None of them generated broker notes. TurboQuant has suddenly done this, not because there is any real proof that it is categorically different, but because it arrived with a memorable name, from a prestigious institution, at the exact moment that a large community of late adopters has begun looking at the topic, particularly in light of the memory wall arguments that have been hitting them in the markets.

Of course, such a conclusion reads excessively harsh and needlessly patronizing. We are glad that many people are joining us in analyzing developments in the field. We just want to ensure that we do not get swayed by the first conclusions of the late converts.

The Pace of Progress: What 3x per Year Actually Means

Since our first writings in 2023, we have repeatedly discussed living in what we coined the Super-Moore era. There are way too many parameters that keep doubling every few months, and the following table elaborates them in more vivid detail. The explosive scale improvement is across many measurable dimensions and this can be documented.

‍

Read that table carefully. The inference cost column has fallen 1,000x in three years. Context windows have expanded 1,000x in the same period. Serving throughput per GPU has improved roughly 30x. These are not projections or claims from research papers. They are measured outcomes, visible in API price lists and benchmark repositories. TurboQuant's 6x memory compression claim, if fully realised in production, would be one data point in a table that already contains 1,000x entries.

The Claims Nobody Wrote About: 2023–2026

The following table documents major AI efficiency claims published over the past three years. The majority of these papers claimed multipliers of 2x to 28x. Most were published without institutional fanfare or catchy names. Most were not covered by the financial press. Some became foundational infrastructure; others are still research-only two years later.

Two entries deserve emphasis. DeepSeek MLA claimed 28x KV compression and it is fully live in production across DeepSeek's entire model family. It received approximately one paragraph of coverage in stories nominally about something else. PagedAttention claimed up to 24x serving throughput improvement. It is now the default layer in virtually every commercial LLM deployment. Neither paper generated anything approaching the TurboQuant reaction.

The KV Cache Specifically: TurboQuant Is One Entry in a Long Series

To understand where TurboQuant sits, it helps to see the full lineage of KV cache research. This is not a new problem. It has attracted sustained attention since 2023, with multiple papers per year advancing the state of the art. The table below shows the research sequence.

The pattern is consistent. Papers appear regularly claiming 2x to 29x improvements on KV memory or throughput. Code releases validate the core claims. Production integration takes longer than the paper implies, if it happens at all.

What Is Actually Unusual Here and What Is Not

Nothing above should be read as a dismissal of TurboQuant. The paper has genuine technical merit. Two properties are worth noting specifically.

The no-calibration property is real and useful. Every prior KV compression approach required either task-specific tuning, calibration data, or made quality tradeoffs that varied unpredictably by task. TurboQuant's data-oblivious design means it can be applied to any model in any deployment without a dataset-specific tuning step. That is a genuine engineering advantage for production integration, and it is the most likely factor to finally push lossless KV quantisation into vLLM.

The timing matters in a way it did not in 2024. Context windows are routinely 128K to 1M tokens today. At those lengths, KV cache memory is not a secondary concern; it often dominates total GPU allocation. A 6x memory saving at 1M-token context is operationally different from the same saving at 8K. The problem TurboQuant targets has grown roughly 1,000x in three years.

But the case this paper represents is different: we are trying to multiply and divide a host of numbers, growing or falling by a factor of 2 or 3, to discuss whether demand for a product, in revenue terms, will be, let’s say, 20% or 30%. The numbers at the technology level are growing too wildly and unreliably to serve as a basis for predicting what may or may not happen to memory demand in 2028.

Final note: it is perhaps unnecessary to say, but the above tables are based on analyses conducted by multiple GPTs. Of course, we have not read most of this research.